Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd

Haodi Zhang, Wenxi Huang, Zhenhan Su, Junyang Chen, Di Jiang, Lixin Fan, Chen Zhang, Defu Lian, Kaishun Wu
{"title":"Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd","authors":"Haodi Zhang, Wenxi Huang, Zhenhan Su, Junyang Chen, Di Jiang, Lixin Fan, Chen Zhang, Defu Lian, Kaishun Wu","doi":"10.1109/ICDE55515.2023.00099","DOIUrl":null,"url":null,"abstract":"With the rapid and continuous development of data-driven technologies such as supervised learning, high-quality labeled data sets are commonly required by many applications. Due to the easiness of crowdsourcing small tasks with low cost, a straightforward solution for label quality improvement is to collect multiple labels from a crowd, and then aggregate the answers. The aggregation strategies include majority voting and its many variants, EM-based approaches, Graph Neural Nets and so on. However, due to the uncertainty information loss and commonly existing task correlations, the aggregated labels usually contain errors and may damnify the downstream model training.To address the above problem, we propose a hierarchical crowdsourcing framework1 for data labeling with noisy answers about correlated data. We make use of the heterogeneity of the labeling crowd and form an initialization-checking-update loop to improve the quality of labeled data. We formalize and successfully solve the core optimization problem, namely, selecting a proper set of checking tasks for each round. We prove that maximizing the expected quality improvement is equivalent to minimizing the conditional entropy of the observations given the crowdsourced answer families for the selected task set, which is NP-hard to solve. Therefore, we design an efficient approximation algorithm and conduct a series of experiments on real data. The experimental results show that the proposed method effectively improves the quality of the labeled data sets as well as the SOTA performance, yet without extra human labor costs.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the rapid and continuous development of data-driven technologies such as supervised learning, high-quality labeled data sets are commonly required by many applications. Due to the easiness of crowdsourcing small tasks with low cost, a straightforward solution for label quality improvement is to collect multiple labels from a crowd, and then aggregate the answers. The aggregation strategies include majority voting and its many variants, EM-based approaches, Graph Neural Nets and so on. However, due to the uncertainty information loss and commonly existing task correlations, the aggregated labels usually contain errors and may damnify the downstream model training.To address the above problem, we propose a hierarchical crowdsourcing framework1 for data labeling with noisy answers about correlated data. We make use of the heterogeneity of the labeling crowd and form an initialization-checking-update loop to improve the quality of labeled data. We formalize and successfully solve the core optimization problem, namely, selecting a proper set of checking tasks for each round. We prove that maximizing the expected quality improvement is equivalent to minimizing the conditional entropy of the observations given the crowdsourced answer families for the selected task set, which is NP-hard to solve. Therefore, we design an efficient approximation algorithm and conduct a series of experiments on real data. The experimental results show that the proposed method effectively improves the quality of the labeled data sets as well as the SOTA performance, yet without extra human labor costs.
基于分层众包的异构人群数据标注
随着监督学习等数据驱动技术的快速和持续发展,许多应用通常需要高质量的标记数据集。由于低成本的小任务容易众包,所以提高标签质量的一种简单的解决方案是从人群中收集多个标签,然后汇总答案。聚合策略包括多数投票及其变体、基于em的方法、图神经网络等。然而,由于不确定性信息的丢失和普遍存在的任务相关性,聚合的标签通常包含错误,可能会损害下游模型的训练。为了解决上述问题,我们提出了一个分层众包框架1,用于对相关数据进行带有噪声答案的数据标记。我们利用标注人群的异质性,形成一个初始化-检查-更新的循环来提高标注数据的质量。我们形式化并成功地解决了核心优化问题,即为每轮选择一组合适的检查任务。我们证明了对于所选任务集,在给定众包答案族的情况下,期望质量改进的最大化等价于最小化观测值的条件熵,这是np困难问题。因此,我们设计了一种高效的近似算法,并在实际数据上进行了一系列实验。实验结果表明,该方法在不增加人工成本的情况下,有效地提高了标记数据集的质量和SOTA性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信