Drahomira Herrmannova, Chathika Gunaratne, V. Walker, Andrew A. Rooney, Robert M. Patton, Mary Wolfe, Charles Schmitt
{"title":"Weak Supervision for Scientific Document Relevance Tagging","authors":"Drahomira Herrmannova, Chathika Gunaratne, V. Walker, Andrew A. Rooney, Robert M. Patton, Mary Wolfe, Charles Schmitt","doi":"10.1109/JCDL52503.2021.00060","DOIUrl":null,"url":null,"abstract":"Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL52503.2021.00060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.