PDFFigures 2.0: Mining figures from research papers

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL) Pub Date : 2016-06-19 DOI:10.1145/2910896.2910904

Christopher Clark, S. Divvala

{"title":"PDFFigures 2.0: Mining figures from research papers","authors":"Christopher Clark, S. Divvala","doi":"10.1145/2910896.2910904","DOIUrl":null,"url":null,"abstract":"Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. To evaluate our work, we introduce a new dataset of computer science papers, along with ground truth labels for the locations of the figures, tables, and captions within them. Our algorithm achieves impressive results (94% precision at 90% recall) on this dataset surpassing previous state of the art. Further, we show how our framework was used to extract figures from a corpus of over one million papers, and how the resulting extractions were integrated into the user interface of a smart academic search engine, Semantic Scholar (www.semanticscholar.org). Finally, we present results of exploratory data analysis completed on the extracted figures as well as an extension of our method for the task of section title extraction. We release our dataset and code on our project webpage for enabling future research (http://pdffigures2.allenai.org).","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"116","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2910896.2910904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 116

Abstract

Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. To evaluate our work, we introduce a new dataset of computer science papers, along with ground truth labels for the locations of the figures, tables, and captions within them. Our algorithm achieves impressive results (94% precision at 90% recall) on this dataset surpassing previous state of the art. Further, we show how our framework was used to extract figures from a corpus of over one million papers, and how the resulting extractions were integrated into the user interface of a smart academic search engine, Semantic Scholar (www.semanticscholar.org). Finally, we present results of exploratory data analysis completed on the extracted figures as well as an extension of our method for the task of section title extraction. We release our dataset and code on our project webpage for enabling future research (http://pdffigures2.allenai.org).

查看原文本刊更多论文

pdfigures 2.0:来自研究论文的采矿数据

在许多学术文献中，图表是重要的信息来源。然而，目前的学术搜索引擎在对文档进行语义分析或向用户呈现文档摘要时，并没有使用图形和表格。为了方便这些应用程序，我们开发了一种算法，可以从称为“pdfigures 2.0”的文档中提取图形、表格和标题。我们提出的方法通过检测标题、图形元素和正文块来分析单个页面的结构，然后通过推理文本中的空白区域来定位图形和表格。为了评估我们的工作，我们引入了一个新的计算机科学论文数据集，并为其中的图形、表格和标题的位置提供了真实值标签。我们的算法在这个数据集上取得了令人印象深刻的结果(94%的准确率和90%的召回率)，超过了以前的技术水平。此外，我们展示了如何使用我们的框架从超过一百万篇论文的语料库中提取图形，以及如何将结果提取集成到智能学术搜索引擎Semantic Scholar (www.semanticscholar.org)的用户界面中。最后，我们提出了对提取的图形完成探索性数据分析的结果，以及对我们的方法进行章节标题提取任务的扩展。我们在我们的项目网页上发布了我们的数据集和代码，以便将来的研究(http://pdffigures2.allenai.org)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量