{"title":"A shadow-based framework for label noise detection and data quality enhancement","authors":"Wanwan Zheng","doi":"10.1016/j.dajour.2025.100588","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning algorithms are typically evaluated using benchmark datasets under the assumption that these datasets are clean. However, recent studies have revealed the presence of label noise in many benchmark datasets, indicating a biased evaluation to date. Confident learning (CL), an emerging method for noise detection, has been regarded a higher priority than the development of new learning algorithms. Although CL is promoted as applicable to various types of data, existing research has largely concentrated on its application to large-scale datasets. Given that many domains handle datasets of more modest size, this study proposed a shadow-based framework for label noise detection called ShadowN, and conducted a comprehensive comparison with CL using six smaller datasets. Four key aspects were examined: the number of detected noises, the distribution of assigned noise scores, the improvement in classification accuracy, and the accuracy of noise detection with artificial noise injection. The results indicated that ShadowN achieved the highest overall classification accuracy and demonstrated superior precision and F-score across all noise levels. While the current implementation of ShadowN is limited to binary classification, our findings underscore its practical value and demonstrate its potential for enhancing data quality in real-world machine learning workflows.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"15 ","pages":"Article 100588"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S277266222500044X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning algorithms are typically evaluated using benchmark datasets under the assumption that these datasets are clean. However, recent studies have revealed the presence of label noise in many benchmark datasets, indicating a biased evaluation to date. Confident learning (CL), an emerging method for noise detection, has been regarded a higher priority than the development of new learning algorithms. Although CL is promoted as applicable to various types of data, existing research has largely concentrated on its application to large-scale datasets. Given that many domains handle datasets of more modest size, this study proposed a shadow-based framework for label noise detection called ShadowN, and conducted a comprehensive comparison with CL using six smaller datasets. Four key aspects were examined: the number of detected noises, the distribution of assigned noise scores, the improvement in classification accuracy, and the accuracy of noise detection with artificial noise injection. The results indicated that ShadowN achieved the highest overall classification accuracy and demonstrated superior precision and F-score across all noise levels. While the current implementation of ShadowN is limited to binary classification, our findings underscore its practical value and demonstrate its potential for enhancing data quality in real-world machine learning workflows.