Peter Ayokunle Popoola , Jules-Raymond Tapamo , Alain Guy Honoré Assounga
{"title":"Effective and efficient handling of missing data in supervised machine learning","authors":"Peter Ayokunle Popoola , Jules-Raymond Tapamo , Alain Guy Honoré Assounga","doi":"10.1016/j.dsm.2024.12.002","DOIUrl":null,"url":null,"abstract":"<div><div>The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses, whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random (MCAR). This study investigates the applicability of this consensus within the context of supervised machine learning, with particular emphasis on the interactions between the imputation method, missingness mechanism, and missingness rate. Furthermore, we examine the time efficiency of these “state-of-the-art” imputation methods considering the time-sensitive nature of certain machine learning applications. Utilizing ten real-world datasets, we introduced missingness at rates ranging from approximately 5%–75% under the MCAR, missing at random (MAR), and missing not at random (MNAR) mechanisms. We subsequently address missing data using five methods: complete case analysis (CCA), mean imputation, hot deck imputation, regression imputation, and multiple imputation (MI). Statistical tests are conducted on the machine learning outcomes, and the findings are presented and analyzed. Our investigation reveals that in nearly all scenarios, CCA performs comparably to MI, even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems. Under some conditions, CCA surpasses MI in terms of its performance. Thus, given the considerable computational demands associated with MI, the application of CCA is recommended within the broader context of supervised machine learning, particularly in big-data environments.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 3","pages":"Pages 361-373"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764924000663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses, whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random (MCAR). This study investigates the applicability of this consensus within the context of supervised machine learning, with particular emphasis on the interactions between the imputation method, missingness mechanism, and missingness rate. Furthermore, we examine the time efficiency of these “state-of-the-art” imputation methods considering the time-sensitive nature of certain machine learning applications. Utilizing ten real-world datasets, we introduced missingness at rates ranging from approximately 5%–75% under the MCAR, missing at random (MAR), and missing not at random (MNAR) mechanisms. We subsequently address missing data using five methods: complete case analysis (CCA), mean imputation, hot deck imputation, regression imputation, and multiple imputation (MI). Statistical tests are conducted on the machine learning outcomes, and the findings are presented and analyzed. Our investigation reveals that in nearly all scenarios, CCA performs comparably to MI, even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems. Under some conditions, CCA surpasses MI in terms of its performance. Thus, given the considerable computational demands associated with MI, the application of CCA is recommended within the broader context of supervised machine learning, particularly in big-data environments.