Benchmarking machine learning methods for the identification of mislabeled data

IF 13.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2025-07-15 DOI:10.1007/s10462-025-11293-9

Lusine Nazaretyan, Ulf Leser, Martin Kircher

{"title":"Benchmarking machine learning methods for the identification of mislabeled data","authors":"Lusine Nazaretyan, Ulf Leser, Martin Kircher","doi":"10.1007/s10462-025-11293-9","DOIUrl":null,"url":null,"abstract":"<div><p>Supervised machine learning recently gained growing importance in various fields of research. To train reliable models, data scientists need credible data, which is not always available. A particularly hard and widespread problem deteriorating the performance of methods are mislabeled samples (Northcutt in J Artif Intell Res 70:1373-1411, 2021). Common sources of mislabeling are weakly defined classes, labels that change their meaning, unsuitable annotators, or ambiguous guidelines for labeling. Because mislabeling lowers prediction quality, it is essential for scientists to be able to identify wrong labels before actually starting the learning process. For that, numerous algorithms for the identification of noisy instances have been developed. However, so far, a comprehensive empirical comparison of available methods has been missing.</p><p>In this paper, we survey and benchmark methods for the identification of mislabeled samples in tabular data. We discuss the theoretical background of label noise and how it can lead to mislabeling, review categorizations of identification methods, and briefly introduce 34 specific approaches together with popular data sets. Finally, 20 selected methods are benchmarked using artificially blurred data with controllable mislabeling and a new real-life genomic dataset with known errors. We compare methods varying the amount and the type of noise, as well as the sample size and domain of data. We find that most of the methods have the highest performance on datasets with a noise level of around 20-30% where the best filters identify around 80% of the noisy instances with relatively high precision (0.58<span>\\(-\\)</span>0.65). Acquiring precise predictions seems to be a more challenging task than identifying most of the noisy instances: while the average recall score over all models ranges from 0.48 to 0.77, the average precision score ranges from 0.16 to 0.55. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and analysis code to enable a better handling of mislabeled data and give recommendations on usage of noise filters depending on various dataset parameters.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 10","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11293-9.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11293-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Supervised machine learning recently gained growing importance in various fields of research. To train reliable models, data scientists need credible data, which is not always available. A particularly hard and widespread problem deteriorating the performance of methods are mislabeled samples (Northcutt in J Artif Intell Res 70:1373-1411, 2021). Common sources of mislabeling are weakly defined classes, labels that change their meaning, unsuitable annotators, or ambiguous guidelines for labeling. Because mislabeling lowers prediction quality, it is essential for scientists to be able to identify wrong labels before actually starting the learning process. For that, numerous algorithms for the identification of noisy instances have been developed. However, so far, a comprehensive empirical comparison of available methods has been missing.

In this paper, we survey and benchmark methods for the identification of mislabeled samples in tabular data. We discuss the theoretical background of label noise and how it can lead to mislabeling, review categorizations of identification methods, and briefly introduce 34 specific approaches together with popular data sets. Finally, 20 selected methods are benchmarked using artificially blurred data with controllable mislabeling and a new real-life genomic dataset with known errors. We compare methods varying the amount and the type of noise, as well as the sample size and domain of data. We find that most of the methods have the highest performance on datasets with a noise level of around 20-30% where the best filters identify around 80% of the noisy instances with relatively high precision (0.58\(-\)0.65). Acquiring precise predictions seems to be a more challenging task than identifying most of the noisy instances: while the average recall score over all models ranges from 0.48 to 0.77, the average precision score ranges from 0.16 to 0.55. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and analysis code to enable a better handling of mislabeled data and give recommendations on usage of noise filters depending on various dataset parameters.

查看原文本刊更多论文

识别错误标记数据的基准机器学习方法

监督式机器学习最近在各个研究领域变得越来越重要。为了训练可靠的模型，数据科学家需要可靠的数据，而这些数据并不总是可用的。影响方法性能的一个特别困难和普遍的问题是错误标记的样本（Northcutt in J Artif intel Res 70:1373- 1411,2021）。错误标记的常见来源是定义弱的类、改变其含义的标签、不合适的注释器或模糊的标记指南。因为错误的标签会降低预测的质量，所以对于科学家来说，在真正开始学习过程之前能够识别错误的标签是至关重要的。为此，已经开发了许多用于识别噪声实例的算法。然而，到目前为止，还没有对现有方法进行全面的实证比较。在本文中，我们对表格数据中错误标记样本的识别方法进行了调查和基准测试。我们讨论了标签噪声的理论背景以及它如何导致错误标记，回顾了识别方法的分类，并简要介绍了34种具体方法以及流行的数据集。最后，对20种选择的方法进行基准测试，使用人工模糊的数据和可控的错误标记，以及一个新的具有已知错误的现实生活基因组数据集。我们比较了不同数量和类型的噪声，以及样本大小和数据域的方法。我们发现大多数方法在噪声水平在20-30左右的数据集上具有最高的性能% where the best filters identify around 80% of the noisy instances with relatively high precision (0.58\(-\)0.65). Acquiring precise predictions seems to be a more challenging task than identifying most of the noisy instances: while the average recall score over all models ranges from 0.48 to 0.77, the average precision score ranges from 0.16 to 0.55. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and analysis code to enable a better handling of mislabeled data and give recommendations on usage of noise filters depending on various dataset parameters.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.