Training-ValueNet: Data Driven Label Noise Cleaning on Weakly-Supervised Web Images

Luka Smyth, D. Kangin, N. Pugeault
{"title":"Training-ValueNet: Data Driven Label Noise Cleaning on Weakly-Supervised Web Images","authors":"Luka Smyth, D. Kangin, N. Pugeault","doi":"10.1109/DEVLRN.2019.8850689","DOIUrl":null,"url":null,"abstract":"Manually labelling new datasets for image classification remains expensive and time-consuming. A promising alternative is to utilize the abundance of images on the web for which search queries or surrounding text offers a natural source of weak supervision. Unfortunately the label noise in these datasets has limited their use in practice. Several methods have been proposed for performing unsupervised label noise cleaning, the majority of which use outlier detection to identify and remove mislabeled images. In this paper, we argue that outlier detection is an inherently unsuitable approach for this task due to major flaws in the assumptions it makes about the distribution of mislabeled images. We propose an alternative approach which makes no such assumptions. Rather than looking for outliers, we observe that mislabeled images can be identified by the detrimental impact they have on the performance of an image classifier. We introduce training-value as an objective measure of the contribution each training example makes to the validation loss. We then present the training-value approximation network (Training-ValueNet) which learns a mapping between each image and its training-value. We demonstrate that by simply discarding images with a negative training-value, Training-ValueNet is able to significantly improve classification performance on a held-out test set, outperforming the state of the art in outlier detection by a large margin.","PeriodicalId":318973,"journal":{"name":"2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEVLRN.2019.8850689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Manually labelling new datasets for image classification remains expensive and time-consuming. A promising alternative is to utilize the abundance of images on the web for which search queries or surrounding text offers a natural source of weak supervision. Unfortunately the label noise in these datasets has limited their use in practice. Several methods have been proposed for performing unsupervised label noise cleaning, the majority of which use outlier detection to identify and remove mislabeled images. In this paper, we argue that outlier detection is an inherently unsuitable approach for this task due to major flaws in the assumptions it makes about the distribution of mislabeled images. We propose an alternative approach which makes no such assumptions. Rather than looking for outliers, we observe that mislabeled images can be identified by the detrimental impact they have on the performance of an image classifier. We introduce training-value as an objective measure of the contribution each training example makes to the validation loss. We then present the training-value approximation network (Training-ValueNet) which learns a mapping between each image and its training-value. We demonstrate that by simply discarding images with a negative training-value, Training-ValueNet is able to significantly improve classification performance on a held-out test set, outperforming the state of the art in outlier detection by a large margin.
Training-ValueNet:弱监督Web图像上数据驱动的标签噪声清除
手动标记新数据集用于图像分类仍然昂贵且耗时。一个有希望的替代方案是利用网络上丰富的图像,搜索查询或周围的文本为弱监督提供了一个自然的来源。不幸的是,这些数据集中的标签噪声限制了它们在实践中的使用。已经提出了几种用于执行无监督标签噪声清洗的方法,其中大多数使用离群值检测来识别和去除错误标记的图像。在本文中,我们认为异常值检测是一种本质上不适合这项任务的方法,因为它对错误标记图像分布的假设存在重大缺陷。我们提出了另一种方法,不做这样的假设。而不是寻找异常值,我们观察到错误标记的图像可以通过它们对图像分类器性能的有害影响来识别。我们引入训练值作为每个训练样例对验证损失的贡献的客观度量。然后,我们提出了训练值近似网络(Training-ValueNet),它学习每个图像与其训练值之间的映射。我们证明,通过简单地丢弃具有负训练值的图像,Training-ValueNet能够显着提高在持有测试集上的分类性能,在离群值检测方面的表现大大超过了目前的技术水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信