A Semiparametric Multiple Imputation Approach to Fully Synthetic Data for Complex Surveys

Year of Publication	2022
Author	Mandi Yu Yulei He Trivellore Raghunathan
Journal	Journal of Survey Statistics and Methodology
Volume	10
Issue	3
Number of Pages	618–641
Abstract	Data synthesis is an effective statistical approach for reducing data disclosure risk. Generating fully synthetic data might minimize such risk, but its modeling and application can be difficult for data from large, complex surveys. This article extended the two-stage imputation to simultaneously impute item missing values and generate fully synthetic data. A new combining rule for making inferences using data generated in this manner was developed. Two semiparametric missing data imputation models were adapted to generate fully synthetic data for skewed continuous variable and sparse binary variable, respectively. The proposed approach was evaluated using simulated data and real longitudinal data from the Health and Retirement Study. The proposed approach was also compared with two existing synthesis approaches: (1) parametric regressions models as implemented in IVEware; and (2) nonparametric Classification and Regression Trees as implemented in synthpop package for R using real data. The results show that high data utility is maintained for a wide variety of descriptive and model-based statistics using the proposed strategy. The proposed strategy also performs better than existing methods for sophisticated analyses such as factor analysis.
DOI	https://doi.org/10.1093/jssam/smac016
Download citation	DOI Google Scholar