Abstract
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation-maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Original language | English |
---|---|
Pages (from-to) | 373-387 |
Number of pages | 15 |
Journal | Computational Statistics and Data Analysis |
Volume | 93 |
DOIs | |
Publication status | Published - 1 Jan 2016 |
Externally published | Yes |
Funding
The work of C. P. de Campos has been mostly done while he was affiliated with the Dalle Molle Institute for Artificial Intelligence and the Institute of Oncology Research. The work was partially supported by a research grant from the Ente Ospedaliero Cantonale (EOC) , Bellinzona, Switzerland; Oncosuisse ( OCS-02034-02- 2007 , OCS-1939-8-2006 ); Swiss NSF grants Nos. 200021_146606/1 and 200020_137680/1 . Appendix A
Keywords
- Bayesian networks
- Data imputation
- Missing data
- Prognostic stratification
- Survival tree