定位和解析HTML医学文章中的参考书目。

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition Pub Date : 2010-06-01 DOI:10.1007/s10032-009-0105-9

Jie Zou, Daniel Le, George R Thoma

{"title":"定位和解析HTML医学文章中的参考书目。","authors":"Jie Zou, Daniel Le, George R Thoma","doi":"10.1007/s10032-009-0105-9","DOIUrl":null,"url":null,"abstract":"The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s10032-009-0105-9","citationCount":"28","resultStr":"{\"title\":\"Locating and parsing bibliographic references in HTML medical articles.\",\"authors\":\"Jie Zou, Daniel Le, George R Thoma\",\"doi\":\"10.1007/s10032-009-0105-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.\",\"PeriodicalId\":50277,\"journal\":{\"name\":\"International Journal on Document Analysis and Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2010-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1007/s10032-009-0105-9\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal on Document Analysis and Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10032-009-0105-9\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Document Analysis and Recognition","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-009-0105-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 28

摘要

通常出现在期刊文章末尾的一组参考文献有时(虽然不总是)是书目(引文)数据库中的一个字段。但是，即使参考文献不构成这样一个领域，它们也可以作为从文章中自动提取其他书目数据的预处理步骤，以及在计算机辅助的文章索引中发挥作用。数据提取和索引的自动化以减少人力劳动是创建和维护大型书目数据库的关键。因此，提取参考文献的组成部分，如作者姓名、文章标题、期刊名称、出版日期和其他实体，是一项有价值的、有时是必要的任务。本文描述了一个使用统计机器学习算法的两步过程，首先定位HTML医学文章中的参考文献，然后对其进行解析。参考文献定位识别文章中的参考文献部分，然后将其分解为单独的参考文献。我们将这一步表述为基于文本和几何特征的两类分类问题。对取自100个医学期刊的500篇文章进行的评估在定位参考文献方面达到了近乎完美的精确度和召回率。引用解析识别每个引用的组件。对于第二步，我们实现并比较两种算法。一个依赖于序列统计并训练一个条件随机场。另一种方法侧重于局部特征统计，并训练一个支持向量机对每个单独的词进行分类，然后是一个搜索算法，如果标签序列违反了一组预定义的规则，则系统地纠正低置信度标签。这两种引用解析算法的总体性能大致相同:在单词级别上准确率超过99%，在块级别上准确率超过97%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Locating and parsing bibliographic references in HTML medical articles.

查看原文本刊更多论文

Locating and parsing bibliographic references in HTML medical articles.

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal on Document Analysis and Recognition 工程技术-计算机：人工智能

CiteScore

6.20

自引率

4.30%

发文量

审稿时长

7.5 months

期刊介绍： The large number of existing documents and the production of a multitude of new ones every year raise important issues in efficient handling, retrieval and storage of these documents and the information which they contain. This has led to the emergence of new research domains dealing with the recognition by computers of the constituent elements of documents - including characters, symbols, text, lines, graphics, images, handwriting, signatures, etc. In addition, these new domains deal with automatic analyses of the overall physical and logical structures of documents, with the ultimate objective of a high-level understanding of their semantic content. We have also seen renewed interest in optical character recognition (OCR) and handwriting recognition during the last decade. Document analysis and recognition are obviously the next stage. Automatic, intelligent processing of documents is at the intersections of many fields of research, especially of computer vision, image analysis, pattern recognition and artificial intelligence, as well as studies on reading, handwriting and linguistics. Although quality document related publications continue to appear in journals dedicated to these domains, the community will benefit from having this journal as a focal point for archival literature dedicated to document analysis and recognition.