A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction

Journal of Data and Information Quality (JDIQ) Pub Date : 2017-02-09 DOI:10.1145/3012003

S. Goldberg, D. Wang, Christan Earl Grant

引用次数: 12

Abstract

The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-the-art structured prediction methods, however, are likely to contain errors and it is important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. In this article, we introduce pi-CASTLE, a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving Conditional Random Fields (CRFs). We propose strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we show an order of magnitude improvement in accuracy gain over baseline methods.

查看原文本刊更多论文

群体辅助文本标注与提取的概率集成系统

近年来，文本数据的数量呈指数级增长，从而产生了将文本注释存储在数据库中的自动信息提取方法。然而，目前最先进的结构化预测方法可能包含错误，能够管理数据库的整体不确定性是很重要的。另一方面，众包的出现使人类能够大规模地帮助机器算法。在本文中，我们介绍pi-CASTLE，这是一个优化和集成人机计算的系统，用于涉及条件随机场(CRFs)的复杂结构化预测问题。我们提出了基于信息论的策略来选择一个令牌子集，为人群制定要标记的问题，并使用约束推理方法将这些标记整合回数据库。在学术引文的文本分割任务和推文的命名实体识别任务上，我们显示了比基线方法在准确性增益方面的数量级提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量