Anu Maria Sebastian , David Peter , Rinu Ann Sebastian
{"title":"一种减少人工智能模型缺失数据处理偏差和误差的最优输入算法","authors":"Anu Maria Sebastian , David Peter , Rinu Ann Sebastian","doi":"10.1016/j.dajour.2025.100627","DOIUrl":null,"url":null,"abstract":"<div><div>Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100627"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models\",\"authors\":\"Anu Maria Sebastian , David Peter , Rinu Ann Sebastian\",\"doi\":\"10.1016/j.dajour.2025.100627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.</div></div>\",\"PeriodicalId\":100357,\"journal\":{\"name\":\"Decision Analytics Journal\",\"volume\":\"16 \",\"pages\":\"Article 100627\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Decision Analytics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772662225000839\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000839","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models
Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.