Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-12 DOI:10.48550/arXiv.2305.07498

Jianfeng Kuang, W. Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, Xiang Bai

{"title":"Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution","authors":"Jianfeng Kuang, W. Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, Xiang Bai","doi":"10.48550/arXiv.2305.07498","DOIUrl":null,"url":null,"abstract":"Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures, background disturbs, and entity categories, they cannot fully reveal the challenges of real-world applications. In this paper, we propose a large-scale dataset consisting of camera images for VIE, which contains not only the larger variance of layout, backgrounds, and fonts but also much more types of entities. Besides, we propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. Different from the previous end-to-end approaches that directly adopt OCR features as the input of an information extraction module, we propose to use contrastive learning to narrow the semantic gap caused by the difference between the tasks of OCR and information extraction. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities. These results demonstrate our dataset is more practical for promoting advanced VIE algorithms. In addition, experiments demonstrate that the proposed VIE method consistently achieves the obvious performance gains on the proposed and SROIE datasets.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.07498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures, background disturbs, and entity categories, they cannot fully reveal the challenges of real-world applications. In this paper, we propose a large-scale dataset consisting of camera images for VIE, which contains not only the larger variance of layout, backgrounds, and fonts but also much more types of entities. Besides, we propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. Different from the previous end-to-end approaches that directly adopt OCR features as the input of an information extraction module, we propose to use contrastive learning to narrow the semantic gap caused by the difference between the tasks of OCR and information extraction. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities. These results demonstrate our dataset is more practical for promoting advanced VIE algorithms. In addition, experiments demonstrate that the proposed VIE method consistently achieves the obvious performance gains on the proposed and SROIE datasets.

查看原文本刊更多论文

野外视觉信息提取:实用数据集和端到端解决方案

视觉信息提取(VIE)以在统一的框架内同时进行OCR和信息提取为目标，由于其在理解收据、货物和交通标志等各种应用中的重要作用而越来越受到关注。然而，由于现有的VIE基准数据集主要由文档图像组成，没有足够的布局结构、背景干扰和实体类别的多样性，因此无法充分揭示现实应用的挑战。在本文中，我们提出了一个由摄像机图像组成的用于VIE的大规模数据集，该数据集不仅包含更大的布局、背景和字体差异，而且包含更多类型的实体。此外，我们提出了一个新的端到端VIE框架，该框架以端到端学习的方式结合了OCR和信息提取的阶段。与以往直接采用OCR特征作为信息提取模块输入的端到端方法不同，我们提出使用对比学习来缩小OCR和信息提取任务之间的差异所导致的语义差距。我们在建议的数据集上评估了现有的端到端VIE方法，并观察到这些方法的性能与SROIE(广泛使用的英语数据集)相比有明显的下降，因为布局和实体的差异更大。这些结果表明，我们的数据集对于推广先进的VIE算法更实用。此外，实验表明，本文提出的VIE方法在本文和SROIE数据集上都能获得明显的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量