A Position-Based Method for the Extraction of Financial Information in PDF Documents

Proceedings of the 21st Australasian Document Computing Symposium Pub Date : 2016-12-05 DOI:10.1145/3015022.3015024

Benoit Potvin, Roger Villemaire, N. Le

{"title":"A Position-Based Method for the Extraction of Financial Information in PDF Documents","authors":"Benoit Potvin, Roger Villemaire, N. Le","doi":"10.1145/3015022.3015024","DOIUrl":null,"url":null,"abstract":"Financial documents are omnipresent and necessitate extensive human efforts in order to extract, validate and export their content. Considering the high importance of such data for effective business decisions, the need for accuracy goes beyond any attempt to accelerate the process or save resources. While many methods have been suggested in the literature, the problem to automatically extract reliable financial data remains difficult to solve in practice and even more challenging to implement in a real life context. This difficulty is driven by the specific nature of financial text where relevant information is principally contained in tables of varying formats. Table Extraction (TE) is considered as an essential but difficult step for restructuring data in a handleable format by identifying and decomposing table components. In this paper, we present a novel method for extracting financial information by the means of two simple heuristics. Our approach is based on the idea that the position of information, in unstructured but visually rich documents - as it is the case for the Portable Document Format (PDF) - is an indicator of semantic relatedness. This solution has been developed in partnership with the Caisse de Depot et Placement du Québec. We present here our method and its evaluation on a corpus of 600 financial documents, where an F-measure of 91% is reached.","PeriodicalId":334601,"journal":{"name":"Proceedings of the 21st Australasian Document Computing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3015022.3015024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Financial documents are omnipresent and necessitate extensive human efforts in order to extract, validate and export their content. Considering the high importance of such data for effective business decisions, the need for accuracy goes beyond any attempt to accelerate the process or save resources. While many methods have been suggested in the literature, the problem to automatically extract reliable financial data remains difficult to solve in practice and even more challenging to implement in a real life context. This difficulty is driven by the specific nature of financial text where relevant information is principally contained in tables of varying formats. Table Extraction (TE) is considered as an essential but difficult step for restructuring data in a handleable format by identifying and decomposing table components. In this paper, we present a novel method for extracting financial information by the means of two simple heuristics. Our approach is based on the idea that the position of information, in unstructured but visually rich documents - as it is the case for the Portable Document Format (PDF) - is an indicator of semantic relatedness. This solution has been developed in partnership with the Caisse de Depot et Placement du Québec. We present here our method and its evaluation on a corpus of 600 financial documents, where an F-measure of 91% is reached.

查看原文本刊更多论文

基于位置的PDF文件财务信息提取方法

财务文件无处不在，需要大量的人力来提取、验证和导出其内容。考虑到这些数据对于有效的业务决策的高度重要性，对准确性的需求超越了任何加速流程或节省资源的尝试。虽然文献中提出了许多方法，但自动提取可靠财务数据的问题在实践中仍然难以解决，在现实生活中实现更是具有挑战性。造成这种困难的原因是财务文本的特殊性质，其中有关资料主要载于各种格式的表格中。表提取(Table Extraction, TE)被认为是通过识别和分解表组件以可处理格式重组数据的一个必要但困难的步骤。本文提出了一种利用两种简单的启发式方法提取财务信息的新方法。我们的方法基于这样一种思想，即信息在非结构化但视觉丰富的文档中的位置——就像可移植文档格式(PDF)的情况一样——是语义相关性的指示器。该解决方案是与caiisse de Depot et Placement du quamesbec合作开发的。我们在这里提出了我们的方法及其对600个财务文件语料库的评估，其中f测量值达到91%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st Australasian Document Computing Symposium

自引率

0.00%

发文量