A Survey on Classifying Big Data with Label Noise

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality Pub Date : 2022-04-10 DOI:10.1145/3492546

Justin M. Johnson, T. Khoshgoftaar

{"title":"A Survey on Classifying Big Data with Label Noise","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1145/3492546","DOIUrl":null,"url":null,"abstract":"Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 43"},"PeriodicalIF":2.9000,"publicationDate":"2022-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 10

Abstract

Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.

查看原文本刊更多论文

基于标签噪声的大数据分类研究

类标签噪声是数据质量的一个关键组成部分，它直接抑制了机器学习算法的预测性能。虽然存在许多数据级和算法级的方法来处理标签噪声，但与大数据相关的挑战需要新的和改进的方法。本调查通过对处理大数据中的标签噪声进行广泛的文献回顾来解决这些问题。本文首先介绍了类标噪声问题和处理类标噪声的传统方法。接下来，我们提出了30种在大数据环境下处理类标签噪声的方法，即大容量、高种类和高速度问题。所调查的工作包括能够在任意大小的数据集上运行的分布式解决方案，用于具有有限干净标签的大规模数据集的深度学习技术，以及用于检测存在概念漂移的类噪声的流技术。在这些领域中确定了共同的趋势和最佳实践，审查了实施细节，在适用的情况下比较了研究中的经验结果，并提供了17个开源项目和编程包的参考资料。强调与大数据相关的标签噪声挑战、解决方案和实证结果，使这项工作成为一项独特的贡献，将激励未来的研究并指导机器学习从业者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Journal of Data and Information Quality COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.10

自引率

4.80%

发文量