Identifying and Extracting Hierarchical Information from Business PDF Documents

15th Innovations in Software Engineering Conference Pub Date : 2022-02-24 DOI:10.1145/3511430.3511440

Rohit Shere, Pavan Kumar Chittimalli, Ravindra Naik

{"title":"Identifying and Extracting Hierarchical Information from Business PDF Documents","authors":"Rohit Shere, Pavan Kumar Chittimalli, Ravindra Naik","doi":"10.1145/3511430.3511440","DOIUrl":null,"url":null,"abstract":"Portable Document Format (PDF) is a popular choice for a secure communication and persistence of business information and is a universally accepted format by businesses choosing to become digital. PDF provides multiple ways to make the information visually appealing and readable, and device independent rendering. To achieve this, PDF stores metadata with individual text characters, graphic components and other layout elements. Such atomic component wise meta-data makes machine processing of information in the PDF format very challenging; the challenge is further extended due to the difficulty of stitching together the original semantics from the componentized information. We propose a generic approach for extracting the hierarchy of the document structure while separating the content from header and footer, and extracting metadata associated with checkboxes to annotate the business information contained in PDF for tasks like mining specifications and rules from the document. Our prototype is able to process real-life, large PDF documents each running into roughly 400 pages, with nearly 95% of the extraction requiring no human intervention.","PeriodicalId":138760,"journal":{"name":"15th Innovations in Software Engineering Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th Innovations in Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511430.3511440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Portable Document Format (PDF) is a popular choice for a secure communication and persistence of business information and is a universally accepted format by businesses choosing to become digital. PDF provides multiple ways to make the information visually appealing and readable, and device independent rendering. To achieve this, PDF stores metadata with individual text characters, graphic components and other layout elements. Such atomic component wise meta-data makes machine processing of information in the PDF format very challenging; the challenge is further extended due to the difficulty of stitching together the original semantics from the componentized information. We propose a generic approach for extracting the hierarchy of the document structure while separating the content from header and footer, and extracting metadata associated with checkboxes to annotate the business information contained in PDF for tasks like mining specifications and rules from the document. Our prototype is able to process real-life, large PDF documents each running into roughly 400 pages, with nearly 95% of the extraction requiring no human intervention.

查看原文本刊更多论文

从商业PDF文档中识别和提取层次信息

可移植文档格式(Portable Document Format, PDF)是安全通信和业务信息持久化的常用选择，也是选择数字化的企业普遍接受的格式。PDF提供了多种方法使信息在视觉上具有吸引力和可读性，以及独立于设备的呈现。为了实现这一点，PDF使用单个文本字符、图形组件和其他布局元素存储元数据。这种原子组件智能元数据使得机器处理PDF格式的信息非常具有挑战性;由于难以将组件化信息的原始语义拼接在一起，因此进一步扩展了这一挑战。我们提出了一种通用方法，用于提取文档结构的层次结构，同时将内容与页眉和页脚分离，并提取与复选框关联的元数据，以便对PDF中包含的业务信息进行注释，以用于从文档中挖掘规范和规则等任务。我们的原型能够处理现实生活中的大型PDF文档，每个文档大约有400页，其中近95%的提取不需要人工干预。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

15th Innovations in Software Engineering Conference

自引率

0.00%

发文量