TY - JOUR
T1 - Missing Data Imputation with High-Dimensional Data
AU - Brini, Alberto
AU - van den Heuvel, Edwin R.
PY - 2024
Y1 - 2024
N2 - Imputation of missing data in high-dimensional datasets with more variables P than samples N, (Formula presented.), is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this article, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modeling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching; and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome (i.e., an extracted set of correlated biomarkers from human urine samples) was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.
AB - Imputation of missing data in high-dimensional datasets with more variables P than samples N, (Formula presented.), is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this article, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modeling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching; and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome (i.e., an extracted set of correlated biomarkers from human urine samples) was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.
KW - High-dimensional data
KW - Linear mixed models
KW - Longitudinal data
KW - Missing data
KW - Multiple imputation
KW - Penalized regression
KW - Principal component analysis
KW - Recursive partitioning
UR - http://www.scopus.com/inward/record.url?scp=85177094246&partnerID=8YFLogxK
U2 - 10.1080/00031305.2023.2259962
DO - 10.1080/00031305.2023.2259962
M3 - Article
AN - SCOPUS:85177094246
SN - 0003-1305
VL - 78
SP - 240
EP - 252
JO - American Statistician
JF - American Statistician
IS - 2
ER -