A shadow-based framework for label noise detection and data quality enhancement

Decision Analytics Journal Pub Date : 2025-06-01 DOI:10.1016/j.dajour.2025.100588

Wanwan Zheng

{"title":"A shadow-based framework for label noise detection and data quality enhancement","authors":"Wanwan Zheng","doi":"10.1016/j.dajour.2025.100588","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning algorithms are typically evaluated using benchmark datasets under the assumption that these datasets are clean. However, recent studies have revealed the presence of label noise in many benchmark datasets, indicating a biased evaluation to date. Confident learning (CL), an emerging method for noise detection, has been regarded a higher priority than the development of new learning algorithms. Although CL is promoted as applicable to various types of data, existing research has largely concentrated on its application to large-scale datasets. Given that many domains handle datasets of more modest size, this study proposed a shadow-based framework for label noise detection called ShadowN, and conducted a comprehensive comparison with CL using six smaller datasets. Four key aspects were examined: the number of detected noises, the distribution of assigned noise scores, the improvement in classification accuracy, and the accuracy of noise detection with artificial noise injection. The results indicated that ShadowN achieved the highest overall classification accuracy and demonstrated superior precision and F-score across all noise levels. While the current implementation of ShadowN is limited to binary classification, our findings underscore its practical value and demonstrate its potential for enhancing data quality in real-world machine learning workflows.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"15 ","pages":"Article 100588"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S277266222500044X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning algorithms are typically evaluated using benchmark datasets under the assumption that these datasets are clean. However, recent studies have revealed the presence of label noise in many benchmark datasets, indicating a biased evaluation to date. Confident learning (CL), an emerging method for noise detection, has been regarded a higher priority than the development of new learning algorithms. Although CL is promoted as applicable to various types of data, existing research has largely concentrated on its application to large-scale datasets. Given that many domains handle datasets of more modest size, this study proposed a shadow-based framework for label noise detection called ShadowN, and conducted a comprehensive comparison with CL using six smaller datasets. Four key aspects were examined: the number of detected noises, the distribution of assigned noise scores, the improvement in classification accuracy, and the accuracy of noise detection with artificial noise injection. The results indicated that ShadowN achieved the highest overall classification accuracy and demonstrated superior precision and F-score across all noise levels. While the current implementation of ShadowN is limited to binary classification, our findings underscore its practical value and demonstrate its potential for enhancing data quality in real-world machine learning workflows.

查看原文本刊更多论文

基于阴影的标签噪声检测和数据质量增强框架

机器学习算法通常使用基准数据集进行评估，假设这些数据集是干净的。然而，最近的研究表明，在许多基准数据集中存在标签噪声，这表明迄今为止的评估存在偏见。自信学习（CL）是一种新兴的噪声检测方法，被认为比开发新的学习算法具有更高的优先级。虽然CL被推广为适用于各种类型的数据，但现有的研究主要集中在大规模数据集的应用上。鉴于许多领域处理的数据集规模较小，本研究提出了一种基于阴影的标签噪声检测框架ShadowN，并使用六个较小的数据集与CL进行了全面比较。研究了四个关键方面：检测到的噪声数量、分配的噪声分数的分布、分类精度的提高以及人工噪声注入的噪声检测精度。结果表明，ShadowN的分类准确率最高，在所有噪声水平上都表现出优异的精度和f值。虽然目前ShadowN的实现仅限于二进制分类，但我们的研究结果强调了它的实用价值，并展示了它在现实世界机器学习工作流程中提高数据质量的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Decision Analytics Journal

CiteScore

3.90

自引率

0.00%

发文量