{"title":"An OpenCV-based Framework for Table Information Extraction","authors":"Jiayi Yuan, Hongye Li, Meng Wang, Ruyang Liu, Chuanyou Li, Beilun Wang","doi":"10.1109/ICBK50248.2020.00093","DOIUrl":null,"url":null,"abstract":"Portable Document Format (PDF), as one of the most popular file format, is especially useful for educational documents such as text books, articles, or papers in which we can preserve the original graphic appearance and conveniently share online. Detecting and extracting information from tables in PDF files can provide a plethora of structural data to construct educational knowledge graphs. However, most of the existing methods rely on PDF parsing tools and natural language processing techniques, which generally require training samples and are frail in handling cross-page tables. In light of this, in this paper, we propose a novel OpenCV-based framework to extract the metadata and specific values from PDF tables. Specifically, we first highlight the visual outline of the tables. Then, we locate tables using horizontal and vertical lines and get the coordinates of tabular frames in each PDF page. Once the tables are successfully detected, for each table, we detect the cross-page scenarios and use the Optical Character Recognition (OCR) engine to extract the specific values in each table cell. Differing from other machine learning based methods, the proposed method can achieve table information extraction accurately without labeled data. We conduct extensive experiments on real-world PDF files. The results demonstrate that our approach can effectively deal with cross-page tables and only need 6.12 seconds on average to process a table.","PeriodicalId":432857,"journal":{"name":"2020 IEEE International Conference on Knowledge Graph (ICKG)","volume":"250 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Knowledge Graph (ICKG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK50248.2020.00093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Portable Document Format (PDF), as one of the most popular file format, is especially useful for educational documents such as text books, articles, or papers in which we can preserve the original graphic appearance and conveniently share online. Detecting and extracting information from tables in PDF files can provide a plethora of structural data to construct educational knowledge graphs. However, most of the existing methods rely on PDF parsing tools and natural language processing techniques, which generally require training samples and are frail in handling cross-page tables. In light of this, in this paper, we propose a novel OpenCV-based framework to extract the metadata and specific values from PDF tables. Specifically, we first highlight the visual outline of the tables. Then, we locate tables using horizontal and vertical lines and get the coordinates of tabular frames in each PDF page. Once the tables are successfully detected, for each table, we detect the cross-page scenarios and use the Optical Character Recognition (OCR) engine to extract the specific values in each table cell. Differing from other machine learning based methods, the proposed method can achieve table information extraction accurately without labeled data. We conduct extensive experiments on real-world PDF files. The results demonstrate that our approach can effectively deal with cross-page tables and only need 6.12 seconds on average to process a table.
可移植文件格式(Portable Document Format, PDF)是最流行的文件格式之一,尤其适用于教科书、文章或论文等教育类文件,我们可以保留其原始图形外观并方便地在网上共享。从PDF文件中的表格中检测和提取信息可以为构建教育知识图谱提供大量的结构数据。然而,大多数现有方法依赖于PDF解析工具和自然语言处理技术,这些方法通常需要训练样本,并且在处理跨页表时很脆弱。鉴于此,在本文中,我们提出了一种新的基于opencv的框架来从PDF表中提取元数据和特定值。具体来说,我们首先突出表的视觉轮廓。然后,我们使用水平线和垂直线定位表格,并获得每个PDF页面中表格框架的坐标。一旦成功检测到表,对于每个表,我们检测跨页面场景并使用光学字符识别(OCR)引擎提取每个表单元格中的特定值。与其他基于机器学习的方法不同,该方法可以在没有标记数据的情况下准确地实现表信息提取。我们对真实的PDF文件进行了广泛的实验。结果表明,我们的方法可以有效地处理跨页表,平均只需要6.12秒来处理一个表。