基于标签分布相似性的众包噪声校正

IF 4.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Computer Science Pub Date : 2023-12-23 DOI:10.1007/s11704-023-2751-3

Lijuan Ren, Liangxiao Jiang, Wenjun Zhang, Chaoqun Li

{"title":"基于标签分布相似性的众包噪声校正","authors":"Lijuan Ren, Liangxiao Jiang, Wenjun Zhang, Chaoqun Li","doi":"10.1007/s11704-023-2751-3","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>In crowdsourcing scenarios, we can obtain each instance’s multiple noisy labels from different crowd workers and then infer its integrated label via label aggregation. In spite of the effectiveness of label aggregation methods, there still remains a certain level of noise in the integrated labels. Thus, some noise correction methods have been proposed to reduce the impact of noise in recent years. However, to the best of our knowledge, existing methods rarely consider an instance’s information from both its features and multiple noisy labels simultaneously when identifying a noise instance. In this study, we argue that the more distinguishable an instance’s features but the noisier its multiple noisy labels, the more likely it is a noise instance. Based on this premise, we propose a label distribution similarity-based noise correction (LDSNC) method. To measure whether an instance’s features are distinguishable, we obtain each instance’s predicted label distribution by building multiple classifiers using instances’ features and their integrated labels. To measure whether an instance’s multiple noisy labels are noisy, we obtain each instance’s multiple noisy label distribution using its multiple noisy labels. Then, we use the Kullback-Leibler (KL) divergence to calculate the similarity between the predicted label distribution and multiple noisy label distribution and define the instance with the lower similarity as a noise instance. The extensive experimental results on 34 simulated and four real-world crowdsourced datasets validate the effectiveness of our method.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"10 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Label distribution similarity-based noise correction for crowdsourcing\",\"authors\":\"Lijuan Ren, Liangxiao Jiang, Wenjun Zhang, Chaoqun Li\",\"doi\":\"10.1007/s11704-023-2751-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<h3>Abstract</h3> <p>In crowdsourcing scenarios, we can obtain each instance’s multiple noisy labels from different crowd workers and then infer its integrated label via label aggregation. In spite of the effectiveness of label aggregation methods, there still remains a certain level of noise in the integrated labels. Thus, some noise correction methods have been proposed to reduce the impact of noise in recent years. However, to the best of our knowledge, existing methods rarely consider an instance’s information from both its features and multiple noisy labels simultaneously when identifying a noise instance. In this study, we argue that the more distinguishable an instance’s features but the noisier its multiple noisy labels, the more likely it is a noise instance. Based on this premise, we propose a label distribution similarity-based noise correction (LDSNC) method. To measure whether an instance’s features are distinguishable, we obtain each instance’s predicted label distribution by building multiple classifiers using instances’ features and their integrated labels. To measure whether an instance’s multiple noisy labels are noisy, we obtain each instance’s multiple noisy label distribution using its multiple noisy labels. Then, we use the Kullback-Leibler (KL) divergence to calculate the similarity between the predicted label distribution and multiple noisy label distribution and define the instance with the lower similarity as a noise instance. The extensive experimental results on 34 simulated and four real-world crowdsourced datasets validate the effectiveness of our method.</p>\",\"PeriodicalId\":12640,\"journal\":{\"name\":\"Frontiers of Computer Science\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2023-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11704-023-2751-3\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-2751-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

摘要在众包场景中，我们可以从不同的众包工作者那里获得每个实例的多个噪声标签，然后通过标签聚合推断其综合标签。尽管标签聚合方法很有效，但整合后的标签仍存在一定程度的噪声。因此，近年来人们提出了一些噪声校正方法来减少噪声的影响。然而，据我们所知，现有的方法在识别噪声实例时很少同时考虑实例的特征信息和多个噪声标签的信息。在本研究中，我们认为，一个实例的特征越明显，但其多个噪声标签越嘈杂，它就越有可能是噪声实例。基于这一前提，我们提出了基于标签分布相似性的噪声校正（LDSNC）方法。为了衡量实例的特征是否可区分，我们利用实例的特征及其集成标签建立多个分类器，从而获得每个实例的预测标签分布。为了衡量一个实例的多重噪声标签是否有噪声，我们使用实例的多重噪声标签获得每个实例的多重噪声标签分布。然后，我们使用库尔巴克-莱伯勒（KL）发散计算预测标签分布与多重噪声标签分布之间的相似度，并将相似度较低的实例定义为噪声实例。在 34 个模拟数据集和 4 个真实世界众包数据集上的大量实验结果验证了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Label distribution similarity-based noise correction for crowdsourcing

Abstract

In crowdsourcing scenarios, we can obtain each instance’s multiple noisy labels from different crowd workers and then infer its integrated label via label aggregation. In spite of the effectiveness of label aggregation methods, there still remains a certain level of noise in the integrated labels. Thus, some noise correction methods have been proposed to reduce the impact of noise in recent years. However, to the best of our knowledge, existing methods rarely consider an instance’s information from both its features and multiple noisy labels simultaneously when identifying a noise instance. In this study, we argue that the more distinguishable an instance’s features but the noisier its multiple noisy labels, the more likely it is a noise instance. Based on this premise, we propose a label distribution similarity-based noise correction (LDSNC) method. To measure whether an instance’s features are distinguishable, we obtain each instance’s predicted label distribution by building multiple classifiers using instances’ features and their integrated labels. To measure whether an instance’s multiple noisy labels are noisy, we obtain each instance’s multiple noisy label distribution using its multiple noisy labels. Then, we use the Kullback-Leibler (KL) divergence to calculate the similarity between the predicted label distribution and multiple noisy label distribution and define the instance with the lower similarity as a noise instance. The extensive experimental results on 34 simulated and four real-world crowdsourced datasets validate the effectiveness of our method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

8.60

自引率

2.40%

发文量

799

审稿时长

6-12 weeks

期刊介绍： Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.