PDF- trex:一种从PDF文档中识别和提取表格的方法

Ermelinda Oro, M. Ruffolo
{"title":"PDF- trex:一种从PDF文档中识别和提取表格的方法","authors":"Ermelinda Oro, M. Ruffolo","doi":"10.1109/ICDAR.2009.12","DOIUrl":null,"url":null,"abstract":"This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.","PeriodicalId":433762,"journal":{"name":"2009 10th International Conference on Document Analysis and Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"94","resultStr":"{\"title\":\"PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents\",\"authors\":\"Ermelinda Oro, M. Ruffolo\",\"doi\":\"10.1109/ICDAR.2009.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.\",\"PeriodicalId\":433762,\"journal\":{\"name\":\"2009 10th International Conference on Document Analysis and Recognition\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"94\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 10th International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2009.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 10th International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2009.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 94

摘要

本文提出了PDF- trex,一种用于PDF文档表识别和提取的启发式方法。启发式从一组初始的基本内容元素开始,通过只考虑它们的空间特征,以自下而上的方式对它们进行对齐和分组,以确定信息的表格排列。该方法的范围是将PDF文档中包含的表识别为笛卡尔平面上的二维网格,并将它们提取为一组由二维坐标配备的单元格。实验结果表明,该方法能够很好地识别表单元。该方法旨在通过提供可以进一步处理以理解表和文档内容的输出来改进PDF文档注释和信息提取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents
This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信