基于机器学习的缺失数据补全分析

IF 1.1 4区计算机科学 Q3 COMPUTER SCIENCE, CYBERNETICS

Cybernetics and Systems Pub Date : 2023-09-09 DOI:10.1080/01969722.2023.2247257

Syed Tahir Hussain Rizvi, Muhammad Yasir Latif, Muhammad Saad Amin, Achraf Jabeur Telmoudi, Nasir Ali Shah

{"title":"基于机器学习的缺失数据补全分析","authors":"Syed Tahir Hussain Rizvi, Muhammad Yasir Latif, Muhammad Saad Amin, Achraf Jabeur Telmoudi, Nasir Ali Shah","doi":"10.1080/01969722.2023.2247257","DOIUrl":null,"url":null,"abstract":"Data analysis and classification can be affected by the availability of missing data in datasets. To deal with missing data, either deletion- or imputation-based methods are used that result in the reduction of data records or imputation of incorrect predicted value. Quality of imputed data can be significantly improved if missing values are generated accurately using machine learning algorithms. In this work, an analysis of machine learning-based algorithms for missing data imputation is performed. The K-nearest neighbors (KNN) and Sequential KNN (SKNN) algorithms are used to impute missing values in datasets using machine learning. Missing values handled using a statistical deletion approach (List-wise Deletion (LD)) and ML-based imputation methods (KNN and SKNN) are then tested and compared using different ML classifiers (Support Vector Machine and Decision Tree) to evaluate the effectiveness of imputed data. The used algorithms are compared in terms of accuracy, and results yielded that the ML-based imputation method (SKNN) outperforms the LD-based approach and KNN method in terms of the effectiveness of handling missing data in almost every dataset with both classification algorithms (SVM and DT).","PeriodicalId":55188,"journal":{"name":"Cybernetics and Systems","volume":"32 1","pages":"0"},"PeriodicalIF":1.1000,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of Machine Learning Based Imputation of Missing Data\",\"authors\":\"Syed Tahir Hussain Rizvi, Muhammad Yasir Latif, Muhammad Saad Amin, Achraf Jabeur Telmoudi, Nasir Ali Shah\",\"doi\":\"10.1080/01969722.2023.2247257\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data analysis and classification can be affected by the availability of missing data in datasets. To deal with missing data, either deletion- or imputation-based methods are used that result in the reduction of data records or imputation of incorrect predicted value. Quality of imputed data can be significantly improved if missing values are generated accurately using machine learning algorithms. In this work, an analysis of machine learning-based algorithms for missing data imputation is performed. The K-nearest neighbors (KNN) and Sequential KNN (SKNN) algorithms are used to impute missing values in datasets using machine learning. Missing values handled using a statistical deletion approach (List-wise Deletion (LD)) and ML-based imputation methods (KNN and SKNN) are then tested and compared using different ML classifiers (Support Vector Machine and Decision Tree) to evaluate the effectiveness of imputed data. The used algorithms are compared in terms of accuracy, and results yielded that the ML-based imputation method (SKNN) outperforms the LD-based approach and KNN method in terms of the effectiveness of handling missing data in almost every dataset with both classification algorithms (SVM and DT).\",\"PeriodicalId\":55188,\"journal\":{\"name\":\"Cybernetics and Systems\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2023-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cybernetics and Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/01969722.2023.2247257\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cybernetics and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/01969722.2023.2247257","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}

引用次数: 0

摘要

数据集中缺失数据的可用性可能会影响数据分析和分类。为了处理缺失数据，可以使用基于删除或基于推测的方法来减少数据记录或推测不正确的预测值。如果使用机器学习算法准确地生成缺失值，则可以显著提高输入数据的质量。在这项工作中，对基于机器学习的缺失数据输入算法进行了分析。使用k近邻(KNN)和顺序KNN (SKNN)算法使用机器学习来估算数据集中的缺失值。然后使用统计删除方法(List-wise deletion (LD))和基于ML的输入方法(KNN和SKNN)进行测试和比较，使用不同的ML分类器(支持向量机和决策树)来评估输入数据的有效性。在精度方面比较了所使用的算法，结果表明，基于ml的imputation方法(SKNN)在处理几乎每个数据集的缺失数据方面都优于基于ld的方法和KNN方法，这两种分类算法(SVM和DT)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of Machine Learning Based Imputation of Missing Data

Data analysis and classification can be affected by the availability of missing data in datasets. To deal with missing data, either deletion- or imputation-based methods are used that result in the reduction of data records or imputation of incorrect predicted value. Quality of imputed data can be significantly improved if missing values are generated accurately using machine learning algorithms. In this work, an analysis of machine learning-based algorithms for missing data imputation is performed. The K-nearest neighbors (KNN) and Sequential KNN (SKNN) algorithms are used to impute missing values in datasets using machine learning. Missing values handled using a statistical deletion approach (List-wise Deletion (LD)) and ML-based imputation methods (KNN and SKNN) are then tested and compared using different ML classifiers (Support Vector Machine and Decision Tree) to evaluate the effectiveness of imputed data. The used algorithms are compared in terms of accuracy, and results yielded that the ML-based imputation method (SKNN) outperforms the LD-based approach and KNN method in terms of the effectiveness of handling missing data in almost every dataset with both classification algorithms (SVM and DT).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cybernetics and Systems 工程技术-计算机：控制论

CiteScore

4.30

自引率

5.90%

发文量

审稿时长

>12 weeks

期刊介绍： Cybernetics and Systems aims to share the latest developments in cybernetics and systems to a global audience of academics working or interested in these areas. We bring together scientists from diverse disciplines and update them in important cybernetic and systems methods, while drawing attention to novel useful applications of these methods to problems from all areas of research, in the humanities, in the sciences and the technical disciplines. Showing a direct or likely benefit of the result(s) of the paper to humankind is welcome but not a prerequisite. We welcome original research that: -Improves methods of cybernetics, systems theory and systems research- Improves methods in complexity research- Shows novel useful applications of cybernetics and/or systems methods to problems in one or more areas in the humanities- Shows novel useful applications of cybernetics and/or systems methods to problems in one or more scientific disciplines- Shows novel useful applications of cybernetics and/or systems methods to technical problems- Shows novel applications in the arts