Carlos Alejandro Aguirre, Sneha Gullapalli, María F. De la Torre, Alice Lam, J. Weese, W. Hsu
{"title":"Learning to Filter Documents for Information Extraction Using Rapid Annotation","authors":"Carlos Alejandro Aguirre, Sneha Gullapalli, María F. De la Torre, Alice Lam, J. Weese, W. Hsu","doi":"10.1109/MLDS.2017.24","DOIUrl":null,"url":null,"abstract":"Corpus-driven approaches to information extraction from documents face problems of relevance determination, namely determining which documents are of requisite type, structure, and content for a specified query and context. In this paper, we discuss the problem of learning to filter documents crawled from the web with respect to such relevance criteria, and in particular how to annotate document corpora for supervised classification learning approaches to this problem. For context, we describe a system aimed at extracting experimental data from scientific publications, with the long-term goal of extracting procedural information from relevant sections on experimental methodology. We consider motivating use cases for our learning filter, using the documents passed by the filter: marking up sections (or passages); capturing entities and relationships; and explaining to a domain expert why a document is relevant. These distinct use cases make the annotation task multi-faceted. Our approach focuses on speeding up annotation in learning to filter while minimizing loss of precision or recall on the learning task, using a reconfigurable user interface. We develop such an interface, report on its use in tandem with classification on a real extraction task, and discuss extensions of this work to visual scene filtering and annotation.","PeriodicalId":248656,"journal":{"name":"2017 International Conference on Machine Learning and Data Science (MLDS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Machine Learning and Data Science (MLDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLDS.2017.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Corpus-driven approaches to information extraction from documents face problems of relevance determination, namely determining which documents are of requisite type, structure, and content for a specified query and context. In this paper, we discuss the problem of learning to filter documents crawled from the web with respect to such relevance criteria, and in particular how to annotate document corpora for supervised classification learning approaches to this problem. For context, we describe a system aimed at extracting experimental data from scientific publications, with the long-term goal of extracting procedural information from relevant sections on experimental methodology. We consider motivating use cases for our learning filter, using the documents passed by the filter: marking up sections (or passages); capturing entities and relationships; and explaining to a domain expert why a document is relevant. These distinct use cases make the annotation task multi-faceted. Our approach focuses on speeding up annotation in learning to filter while minimizing loss of precision or recall on the learning task, using a reconfigurable user interface. We develop such an interface, report on its use in tandem with classification on a real extraction task, and discuss extensions of this work to visual scene filtering and annotation.