Identifying and Extracting Hierarchical Information from Business PDF Documents

Rohit Shere, Pavan Kumar Chittimalli, Ravindra Naik
{"title":"Identifying and Extracting Hierarchical Information from Business PDF Documents","authors":"Rohit Shere, Pavan Kumar Chittimalli, Ravindra Naik","doi":"10.1145/3511430.3511440","DOIUrl":null,"url":null,"abstract":"Portable Document Format (PDF) is a popular choice for a secure communication and persistence of business information and is a universally accepted format by businesses choosing to become digital. PDF provides multiple ways to make the information visually appealing and readable, and device independent rendering. To achieve this, PDF stores metadata with individual text characters, graphic components and other layout elements. Such atomic component wise meta-data makes machine processing of information in the PDF format very challenging; the challenge is further extended due to the difficulty of stitching together the original semantics from the componentized information. We propose a generic approach for extracting the hierarchy of the document structure while separating the content from header and footer, and extracting metadata associated with checkboxes to annotate the business information contained in PDF for tasks like mining specifications and rules from the document. Our prototype is able to process real-life, large PDF documents each running into roughly 400 pages, with nearly 95% of the extraction requiring no human intervention.","PeriodicalId":138760,"journal":{"name":"15th Innovations in Software Engineering Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th Innovations in Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511430.3511440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Portable Document Format (PDF) is a popular choice for a secure communication and persistence of business information and is a universally accepted format by businesses choosing to become digital. PDF provides multiple ways to make the information visually appealing and readable, and device independent rendering. To achieve this, PDF stores metadata with individual text characters, graphic components and other layout elements. Such atomic component wise meta-data makes machine processing of information in the PDF format very challenging; the challenge is further extended due to the difficulty of stitching together the original semantics from the componentized information. We propose a generic approach for extracting the hierarchy of the document structure while separating the content from header and footer, and extracting metadata associated with checkboxes to annotate the business information contained in PDF for tasks like mining specifications and rules from the document. Our prototype is able to process real-life, large PDF documents each running into roughly 400 pages, with nearly 95% of the extraction requiring no human intervention.
从商业PDF文档中识别和提取层次信息
可移植文档格式(Portable Document Format, PDF)是安全通信和业务信息持久化的常用选择,也是选择数字化的企业普遍接受的格式。PDF提供了多种方法使信息在视觉上具有吸引力和可读性,以及独立于设备的呈现。为了实现这一点,PDF使用单个文本字符、图形组件和其他布局元素存储元数据。这种原子组件智能元数据使得机器处理PDF格式的信息非常具有挑战性;由于难以将组件化信息的原始语义拼接在一起,因此进一步扩展了这一挑战。我们提出了一种通用方法,用于提取文档结构的层次结构,同时将内容与页眉和页脚分离,并提取与复选框关联的元数据,以便对PDF中包含的业务信息进行注释,以用于从文档中挖掘规范和规则等任务。我们的原型能够处理现实生活中的大型PDF文档,每个文档大约有400页,其中近95%的提取不需要人工干预。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信