Information Extraction from Arabic and Latin scanned invoices

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR) Pub Date : 2018-03-12 DOI:10.1109/ASAR.2018.8480221

Najoua Rahal, Maroua Tounsi, M. B. Jlaiel, A. Alimi

引用次数: 7

Abstract

The relevant entity extraction from scanned document image is a very challenging task due to highly heterogeneous templates, and several structure layouts. These problems lead to inaccuracy for document image recognized by OCR. In this paper, we propose an effective solution for these problems, in which the relevant entities are extracted from Arabic and Latin scanned invoices. The input of the system is an invoice image which is submitted to an OCR without layout analysis. After, invoices are labeled in the text recognized by the OCR. By combining the logical and physical structures, a local graph model is built for extraction entity. Finally, we implement a correction module which requires the mislabeling correction by eliminating the superfluous parts detected by labeling step. We evaluate the obtained results with 1050 real invoices as reported in experimental section.

查看原文本刊更多论文

信息提取阿拉伯文和拉丁文扫描发票

相关实体提取扫描文档图像是一项非常具有挑战性的任务由于高度异构的模板,和几个结构布局。这些问题导致了OCR识别文档图像的不准确性。在本文中，我们针对这些问题提出了一个有效的解决方案，即从阿拉伯文和拉丁文扫描发票中提取相关实体。系统的输入是提交给OCR的发票图像，而不进行布局分析。之后，发票在OCR识别的文本中被标记。通过逻辑结构和物理结构的结合，建立了抽取实体的局部图模型。最后，我们实现了一个校正模块，该模块需要通过消除标注步骤检测到的多余部分来进行误标注校正。我们用实验部分报道的1050张真实发票来评估所得结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)

自引率

0.00%

发文量