Training-ValueNet: Data Driven Label Noise Cleaning on Weakly-Supervised Web Images

2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) Pub Date : 2019-09-30 DOI:10.1109/DEVLRN.2019.8850689

Luka Smyth, D. Kangin, N. Pugeault

{"title":"Training-ValueNet: Data Driven Label Noise Cleaning on Weakly-Supervised Web Images","authors":"Luka Smyth, D. Kangin, N. Pugeault","doi":"10.1109/DEVLRN.2019.8850689","DOIUrl":null,"url":null,"abstract":"Manually labelling new datasets for image classification remains expensive and time-consuming. A promising alternative is to utilize the abundance of images on the web for which search queries or surrounding text offers a natural source of weak supervision. Unfortunately the label noise in these datasets has limited their use in practice. Several methods have been proposed for performing unsupervised label noise cleaning, the majority of which use outlier detection to identify and remove mislabeled images. In this paper, we argue that outlier detection is an inherently unsuitable approach for this task due to major flaws in the assumptions it makes about the distribution of mislabeled images. We propose an alternative approach which makes no such assumptions. Rather than looking for outliers, we observe that mislabeled images can be identified by the detrimental impact they have on the performance of an image classifier. We introduce training-value as an objective measure of the contribution each training example makes to the validation loss. We then present the training-value approximation network (Training-ValueNet) which learns a mapping between each image and its training-value. We demonstrate that by simply discarding images with a negative training-value, Training-ValueNet is able to significantly improve classification performance on a held-out test set, outperforming the state of the art in outlier detection by a large margin.","PeriodicalId":318973,"journal":{"name":"2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEVLRN.2019.8850689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Manually labelling new datasets for image classification remains expensive and time-consuming. A promising alternative is to utilize the abundance of images on the web for which search queries or surrounding text offers a natural source of weak supervision. Unfortunately the label noise in these datasets has limited their use in practice. Several methods have been proposed for performing unsupervised label noise cleaning, the majority of which use outlier detection to identify and remove mislabeled images. In this paper, we argue that outlier detection is an inherently unsuitable approach for this task due to major flaws in the assumptions it makes about the distribution of mislabeled images. We propose an alternative approach which makes no such assumptions. Rather than looking for outliers, we observe that mislabeled images can be identified by the detrimental impact they have on the performance of an image classifier. We introduce training-value as an objective measure of the contribution each training example makes to the validation loss. We then present the training-value approximation network (Training-ValueNet) which learns a mapping between each image and its training-value. We demonstrate that by simply discarding images with a negative training-value, Training-ValueNet is able to significantly improve classification performance on a held-out test set, outperforming the state of the art in outlier detection by a large margin.

查看原文本刊更多论文

Training-ValueNet:弱监督Web图像上数据驱动的标签噪声清除

手动标记新数据集用于图像分类仍然昂贵且耗时。一个有希望的替代方案是利用网络上丰富的图像，搜索查询或周围的文本为弱监督提供了一个自然的来源。不幸的是，这些数据集中的标签噪声限制了它们在实践中的使用。已经提出了几种用于执行无监督标签噪声清洗的方法，其中大多数使用离群值检测来识别和去除错误标记的图像。在本文中，我们认为异常值检测是一种本质上不适合这项任务的方法，因为它对错误标记图像分布的假设存在重大缺陷。我们提出了另一种方法，不做这样的假设。而不是寻找异常值，我们观察到错误标记的图像可以通过它们对图像分类器性能的有害影响来识别。我们引入训练值作为每个训练样例对验证损失的贡献的客观度量。然后，我们提出了训练值近似网络(Training-ValueNet)，它学习每个图像与其训练值之间的映射。我们证明，通过简单地丢弃具有负训练值的图像，Training-ValueNet能够显着提高在持有测试集上的分类性能，在离群值检测方面的表现大大超过了目前的技术水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)

自引率

0.00%

发文量