A Statistical Solution to Synthetic Data Generation for Patient Files

Microsoft

Jun 09, 2020

Guest blog by Louis Phillips.

I’m a Second Year Computer Science undergraduate student at University College London, with a keen interest in Machine and Statistical Learning. LinkedIn: https://www.linkedin.com/in/louis-phillips-03756711a/.

The project I created for the FHIR hackathon is a mathematically driven library which generates a DataFrame of patient files given input data, which preserves statistical properties of the original data.

The type of statistics employed is an important demonstrator of a method of generating synthetic data given a small initial dataset to base it on, and is a technique that will generally surpass widely used (but data-intensive) deep learning techniques on these types of smaller datasets.

Theoretical Solution:

Firstly, we must frame the problem appropriately. Formally, suppose we have an unknown density f(X=x ┤|θ); θ is our parameter vector for the distribution, and the function represents the density of random variable X at point x. We have a sample of the density function with a set of parameters θ_s, from which we wish to obtain an unbiased estimate θ ̂ of our parameter vector such that θ ̂=θ_s, and generate additional data Y ̂ with θ ̂ parameters.

Let M,μ and Σ be our data vectors, their mean vector, and their positive-definite covariance matrix respectively. It is required to generate a sample from a distribution whose population characteristics match the moment of M. In order to achieve this, our sample Y ̂ should satisfy Y ̂=Lz+μ, where z is a sample from a multivariate normal distribution in the required shape for the sample, and L is a Cholesky decomposition of Σ such that LL^T=Σ. This Y ̂, sampled from distribution f(Y=y|μ,Σ) is now known to have the desired population characteristics.

The library built during the FHIR hackathon is simply an implementation of this theory. It includes code to clean the data provided by FHIR to ensure they are in a clean array format, and final “sanity check” adjustments made to values – such as rounding of negative values to 0 where they are impossible, omitting certain data which clearly does not fit the initial distribution, and finally wrapping the output in a DataFrame.

The Github code solution can be found at https://github.com/hiraphor/FHIRworks_2020. This includes a Python file and a Jupyter Notebook file with the implementation mentioned.

Here is a visual example of four random features plotted against each other. The left four are from the original sample provided, and the right four are samples that were generated synthetically by the library.

Learning Points:

Attempts with standard deep learning approaches to generate synthetic data produce high error, overfitted data which clearly does not follow the underlying distribution.
Standard out-of-the-box machine learning algorithms are often ineffective on smaller datasets, and to get an accurate fit domain knowledge or feature engineering is a must.

Future Work:

The major issue with this type of statistical data synthesis is the inability to capture certain nonlinearities in relationships between data vectors. Given more time, exciting progress could be made by using certain machine learning tools to augment this synthesis. For example, data vectors could be transformed through kernel functions to higher dimensions where linear separation could be performed via support vector machines, and thus support vector regression could be used as a supporting tool. Alternatively, function approximators such as neural networks could be used to regress non-linear data relationships given synthetically generated data as inputs, which would remove the need to generate the entire dataset, whilst eliminating the issue of non-linearity.

Updated Jun 09, 2020

Version 3.0