An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) Pub Date : 2011-11-12 DOI:10.1109/BIBM.2011.26

L. D. Lopez, Jingyi Yu, C. Arighi, Hongzhan Huang, H. Shatkay, Cathy H. Wu

{"title":"An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents","authors":"L. D. Lopez, Jingyi Yu, C. Arighi, Hongzhan Huang, H. Shatkay, Cathy H. Wu","doi":"10.1109/BIBM.2011.26","DOIUrl":null,"url":null,"abstract":"Figures in biomedical articles often constitute direct evidence of experimental results. Image analysis methods can be coupled with text-based methods to improve knowledge discovery. However, automatically harvesting figures along with their associated captions from full-text articles remains challenging. In this paper, we present an automatic system for robustly harvesting figures from biomedical literature. Our approach relies on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. This allows us to harvest fragments of figures (subfigures), from the PDF, correctly identify subfigures that belong to the same figure, and identify the captions associated with each figure. Our method simultaneously recovers figures and captions and applies additional filtering process to remove irrelevant figures such as logos, to eliminate text passages that were incorrectly identified as captions, and to re-group subfigures to generate a putative figure. Finally, we associate figures with captions. Our preliminary experiments suggest that our method achieves an accuracy of 95% in harvesting figures-caption pairs from a set of 2, 035 full-text biomedical documents from Bio Creative III, containing 12, 574 figures.","PeriodicalId":6345,"journal":{"name":"2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)","volume":"55 1","pages":"578-581"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2011.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Figures in biomedical articles often constitute direct evidence of experimental results. Image analysis methods can be coupled with text-based methods to improve knowledge discovery. However, automatically harvesting figures along with their associated captions from full-text articles remains challenging. In this paper, we present an automatic system for robustly harvesting figures from biomedical literature. Our approach relies on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. This allows us to harvest fragments of figures (subfigures), from the PDF, correctly identify subfigures that belong to the same figure, and identify the captions associated with each figure. Our method simultaneously recovers figures and captions and applies additional filtering process to remove irrelevant figures such as logos, to eliminate text passages that were incorrectly identified as captions, and to re-group subfigures to generate a putative figure. Finally, we associate figures with captions. Our preliminary experiments suggest that our method achieves an accuracy of 95% in harvesting figures-caption pairs from a set of 2, 035 full-text biomedical documents from Bio Creative III, containing 12, 574 figures.

查看原文本刊更多论文

生物医学PDF文档中图形和标题的自动提取系统

生物医学文章中的数字常常构成实验结果的直接证据。图像分析方法可以与基于文本的方法相结合，以提高知识发现。然而，从全文文章中自动获取带有相关标题的图表仍然具有挑战性。在本文中，我们提出了一个从生物医学文献中健壮地获取图形的自动系统。我们的方法依赖于文档布局的PDF规范可用于识别PDF中的编码图形和图形边界，并在图形区域之间实施约束的思想。这允许我们从PDF中获取图形片段(子图)，正确识别属于同一图形的子图，并识别与每个图形相关的标题。我们的方法同时恢复图形和标题，并应用额外的过滤过程来去除不相关的图形，如徽标，消除被错误识别为标题的文本段落，并重新分组子图以生成假定的图形。最后，我们将数字与标题联系起来。我们的初步实验表明，我们的方法在从Bio Creative III的一组包含12,574个数字的2,035个全文生物医学文档中获取数字-标题对时达到了95%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)

自引率

0.00%

发文量