A Survey on Classifying Big Data with Label Noise

IF 1.5 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Justin M. Johnson, T. Khoshgoftaar
{"title":"A Survey on Classifying Big Data with Label Noise","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1145/3492546","DOIUrl":null,"url":null,"abstract":"Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"22 1","pages":"1 - 43"},"PeriodicalIF":1.5000,"publicationDate":"2022-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 10

Abstract

Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
基于标签噪声的大数据分类研究
类标签噪声是数据质量的一个关键组成部分,它直接抑制了机器学习算法的预测性能。虽然存在许多数据级和算法级的方法来处理标签噪声,但与大数据相关的挑战需要新的和改进的方法。本调查通过对处理大数据中的标签噪声进行广泛的文献回顾来解决这些问题。本文首先介绍了类标噪声问题和处理类标噪声的传统方法。接下来,我们提出了30种在大数据环境下处理类标签噪声的方法,即大容量、高种类和高速度问题。所调查的工作包括能够在任意大小的数据集上运行的分布式解决方案,用于具有有限干净标签的大规模数据集的深度学习技术,以及用于检测存在概念漂移的类噪声的流技术。在这些领域中确定了共同的趋势和最佳实践,审查了实施细节,在适用的情况下比较了研究中的经验结果,并提供了17个开源项目和编程包的参考资料。强调与大数据相关的标签噪声挑战、解决方案和实证结果,使这项工作成为一项独特的贡献,将激励未来的研究并指导机器学习从业者。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Journal of Data and Information Quality
ACM Journal of Data and Information Quality COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
4.10
自引率
4.80%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信