Weak Supervision for Scientific Document Relevance Tagging

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-09-01 DOI:10.1109/JCDL52503.2021.00060

Drahomira Herrmannova, Chathika Gunaratne, V. Walker, Andrew A. Rooney, Robert M. Patton, Mary Wolfe, Charles Schmitt

引用次数: 0

Abstract

Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.

查看原文本刊更多论文

科学文献相关标注监管薄弱

开发用于预测研究文章与科学概念相关性的训练数据是一个资源密集型的过程，现有的数据集仅可用于有限的主题领域。在这项工作中，我们研究了弱监督数据生成用于开发相关模型的可能性。我们通过以自动化的方式生成文档、查询和标签三元组，并使用这些数据为分类模型创建训练集来解决这个问题。已发布的文档从开放访问存储库中采样，这些文档中出现的概念被用作查询。我们使用文档中每个查询概念出现的位置来确定相关标签。我们发现，在这个合成数据上训练的分类模型可以很好地根据与查询的相关性来学习标记文档，比在真实数据上训练的模型提高了11%的f分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量