Active Content-Based Crowdsourcing Task Selection

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management Pub Date : 2016-10-24 DOI:10.1145/2983323.2983716

P. Bansal, Carsten Eickhoff, Thomas Hofmann

{"title":"Active Content-Based Crowdsourcing Task Selection","authors":"P. Bansal, Carsten Eickhoff, Thomas Hofmann","doi":"10.1145/2983323.2983716","DOIUrl":null,"url":null,"abstract":"Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable. In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17% - 25% less budget.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983323.2983716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Crowdsourcing has long established itself as a viable alternative to corpus annotation by domain experts for tasks such as document relevance assessment. The crowdsourcing process traditionally relies on high degrees of label redundancy in order to mitigate the detrimental effects of individually noisy worker submissions. Such redundancy comes at the cost of increased label volume, and, subsequently, monetary requirements. In practice, especially as the size of datasets increases, this is undesirable. In this paper, we focus on an alternate method that exploits document information instead, to infer relevance labels for unjudged documents. We present an active learning scheme for document selection that aims at maximising the overall relevance label prediction accuracy, for a given budget of available relevance judgements by exploiting system-wide estimates of label variance and mutual information. Our experiments are based on TREC 2011 Crowdsourcing Track data and show that our method is able to achieve state-of-the-art performance while requiring 17% - 25% less budget.

查看原文本刊更多论文

基于活动内容的众包任务选择

长期以来，众包已经成为领域专家在文档相关性评估等任务中进行语料库注释的可行替代方案。传统的众包过程依赖于高度的标签冗余，以减轻个别嘈杂的工人提交的有害影响。这种冗余的代价是增加标签容量，并随后增加货币需求。在实践中，特别是当数据集的大小增加时，这是不可取的。在本文中，我们关注的是一种利用文档信息来推断未判断文档的相关标签的替代方法。我们提出了一种用于文档选择的主动学习方案，旨在通过利用系统范围内的标签方差和互信息估计，在给定的可用相关性判断预算下，最大化总体相关标签预测准确性。我们的实验基于TREC 2011众包跟踪数据，并表明我们的方法能够达到最先进的性能，同时需要减少17% - 25%的预算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

自引率

0.00%

发文量