Probability density imputation of missing data with Gaussian Mixture Models


A method for filling in missing measurements in points in datasets, with some statistical assumptions, by drawing from the conditional density of the data as modelled by a Bayesian mixture of Gaussians.

Multiple imputation (Rubin, 2004) is a procedure for conducting unbiased data analysis of an incomplete data set X, using complete-data analysis techniques. It consists in generating m imputed data sets in which the missing values Xmis have been replaced by plausible values, analysing the imputed data sets, and combining the results. To assert the soundness of this approach, the plausible values are assumed to be drawn from the posterior distribution of the missing values given the data, p(Xmis | X). Typically, m ≤ 10 samples are used.

In this thesis we go one step further, and set m → ∞. That is, we attempt to estimate the distribution p(Xmis | X) with a closed-form probability distribution, performing probability density imputation. To this end, we provide the Partially Observed Infinite Gaussian Mixture Model (POIGMM), an algorithm for (1) density estimation from incomplete data sets, and (2) density imputation; based on the Infinite GMM by Blei and Jordan (2006).

A density-imputed data set can easily be reduced to a multiple-imputed data set, by taking m samples from the estimated posterior distribution. Accordingly, we compare the POIGMM with the state of the art in multiple imputation: MissForest (Stekhoven and Bühlmann, 2011) and MICE (Van Buuren and Oudshoorn, 1999).

Probability density imputation of missing data with Gaussian Mixture Models