Effective and efficient handling of missing data in supervised machine learning

Data Science and Management Pub Date : 2025-09-01 DOI:10.1016/j.dsm.2024.12.002

Peter Ayokunle Popoola , Jules-Raymond Tapamo , Alain Guy Honoré Assounga

{"title":"Effective and efficient handling of missing data in supervised machine learning","authors":"Peter Ayokunle Popoola , Jules-Raymond Tapamo , Alain Guy Honoré Assounga","doi":"10.1016/j.dsm.2024.12.002","DOIUrl":null,"url":null,"abstract":"<div><div>The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses, whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random (MCAR). This study investigates the applicability of this consensus within the context of supervised machine learning, with particular emphasis on the interactions between the imputation method, missingness mechanism, and missingness rate. Furthermore, we examine the time efficiency of these “state-of-the-art” imputation methods considering the time-sensitive nature of certain machine learning applications. Utilizing ten real-world datasets, we introduced missingness at rates ranging from approximately 5%–75% under the MCAR, missing at random (MAR), and missing not at random (MNAR) mechanisms. We subsequently address missing data using five methods: complete case analysis (CCA), mean imputation, hot deck imputation, regression imputation, and multiple imputation (MI). Statistical tests are conducted on the machine learning outcomes, and the findings are presented and analyzed. Our investigation reveals that in nearly all scenarios, CCA performs comparably to MI, even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems. Under some conditions, CCA surpasses MI in terms of its performance. Thus, given the considerable computational demands associated with MI, the application of CCA is recommended within the broader context of supervised machine learning, particularly in big-data environments.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 3","pages":"Pages 361-373"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764924000663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The prevailing consensus in statistical literature is that multiple imputation is generally the most suitable method for addressing missing data in statistical analyses, whereas a complete case analysis is deemed appropriate only when the rate of missingness is negligible or when the missingness mechanism is missing completely at random (MCAR). This study investigates the applicability of this consensus within the context of supervised machine learning, with particular emphasis on the interactions between the imputation method, missingness mechanism, and missingness rate. Furthermore, we examine the time efficiency of these “state-of-the-art” imputation methods considering the time-sensitive nature of certain machine learning applications. Utilizing ten real-world datasets, we introduced missingness at rates ranging from approximately 5%–75% under the MCAR, missing at random (MAR), and missing not at random (MNAR) mechanisms. We subsequently address missing data using five methods: complete case analysis (CCA), mean imputation, hot deck imputation, regression imputation, and multiple imputation (MI). Statistical tests are conducted on the machine learning outcomes, and the findings are presented and analyzed. Our investigation reveals that in nearly all scenarios, CCA performs comparably to MI, even with substantial levels of missingness under the MAR and MNAR conditions and with missingness in the output variable for regression problems. Under some conditions, CCA surpasses MI in terms of its performance. Thus, given the considerable computational demands associated with MI, the application of CCA is recommended within the broader context of supervised machine learning, particularly in big-data environments.

查看原文本刊更多论文

在监督机器学习中有效和高效地处理缺失数据

统计文献中普遍的共识是，多重归算通常是统计分析中处理缺失数据的最合适方法，而完整的案例分析只有在缺失率可以忽略不计或缺失机制完全随机缺失（MCAR）时才被认为是合适的。本研究探讨了这一共识在监督机器学习背景下的适用性，特别强调了归算方法、缺失机制和缺失率之间的相互作用。此外，考虑到某些机器学习应用的时间敏感性，我们检查了这些“最先进”的imputation方法的时间效率。利用10个真实世界的数据集，我们在MCAR、随机缺失（MAR）和非随机缺失（MNAR）机制下引入了大约5%-75%的缺失率。随后，我们使用五种方法来解决缺失的数据：完整案例分析（CCA）、平均归算、热甲板归算、回归归算和多重归算（MI）。对机器学习结果进行统计测试，并对结果进行展示和分析。我们的调查表明，在几乎所有的情况下，即使在MAR和MNAR条件下存在大量缺失，并且在回归问题的输出变量中存在缺失，CCA的表现也与MI相当。在某些条件下，CCA的性能优于MI。因此，考虑到与人工智能相关的大量计算需求，建议在更广泛的监督机器学习背景下应用CCA，特别是在大数据环境中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Science and Management

CiteScore

7.50

自引率

0.00%

发文量