一种减少人工智能模型缺失数据处理偏差和误差的最优输入算法

Decision Analytics Journal Pub Date : 2025-09-01 DOI:10.1016/j.dajour.2025.100627

Anu Maria Sebastian , David Peter , Rinu Ann Sebastian

{"title":"一种减少人工智能模型缺失数据处理偏差和误差的最优输入算法","authors":"Anu Maria Sebastian , David Peter , Rinu Ann Sebastian","doi":"10.1016/j.dajour.2025.100627","DOIUrl":null,"url":null,"abstract":"<div><div>Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100627"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models\",\"authors\":\"Anu Maria Sebastian , David Peter , Rinu Ann Sebastian\",\"doi\":\"10.1016/j.dajour.2025.100627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.</div></div>\",\"PeriodicalId\":100357,\"journal\":{\"name\":\"Decision Analytics Journal\",\"volume\":\"16 \",\"pages\":\"Article 100627\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Decision Analytics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772662225000839\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000839","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据是人工智能（AI）为底层机器学习（ML）算法提供动力的重要燃料。在大多数现实世界的数据集中，由于测量误差、无响应和数据收集过程中的人为错误，数据丢失是很常见的，这最终会导致人工智能模型的准确性和可靠性降低。此外，许多ML算法被设计用于处理完整的数据集。数据输入（DI）有助于创建数据的全面表示，允许AI模型从丰富的数据集中学习并生成更准确的结果。因此，选择合适的输入技术对于最小化输入过程中引入的数据误差和偏差至关重要。创建在整个数据范围内执行最佳的插补方法的困难源于不同数据集所显示的固有特征的差异。所有现有的DI选择方法都是计算密集型的，需要在每个新数据集上对流行的DI方法进行重复和详尽的实验来评估其适用性，导致大量的时间和精力浪费。本研究提出了一种基于数据集的内在特征，系统选择最优输入技术的算法。它使用特征图（C-chart）将DI算法的性能与给定数据集的特定特征联系起来。生成的DI推荐对于具有类似c图的另一个数据集仍然有效。因此，我们的方法消除了为找到合适的DI方法而进行详尽实验的需要，并为缺乏可验证的基础真理的真实世界数据集提供了可靠的输入。我们使用一套6个基准DI算法、8个公共数据集和2个ML分类器演示了我们的方法的性能。我们使用归一化均方根误差（NRMSE）和詹森香农距离（JSD）分数来评估DI算法的潜力。我们可以观察到，推荐的DI算法可以将ML分类器的准确率提高19.8%。我们认为，所提出的算法是朝着基于数据特征自动选择最佳DI技术迈出的重要一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models

Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Decision Analytics Journal

CiteScore

3.90

自引率

0.00%

发文量