A Graphical Approach to Document Layout Analysis

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-03 DOI:10.48550/arXiv.2308.02051

Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, M. Sokolov, Vadym Barda, Delphine Vendryes, Christy Tanner

{"title":"A Graphical Approach to Document Layout Analysis","authors":"Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, M. Sokolov, Vadym Barda, Delphine Vendryes, Christy Tanner","doi":"10.48550/arXiv.2308.02051","DOIUrl":null,"url":null,"abstract":"Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2308.02051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.

查看原文本刊更多论文

文档布局分析的图形化方法

文档布局分析(DLA)的任务是检测文档中不同的语义内容，并将这些内容正确地分类到适当的类别中(例如，文本、标题、图形)。DLA管道使用户能够将文档转换为结构化的机器可读格式，然后可用于许多有用的下游任务。大多数现有的最先进(SOTA) DLA模型将文档表示为图像，丢弃了电子生成的pdf中可用的丰富元数据。直接利用这些元数据，我们将每个PDF页面表示为一个结构化的图，并将DLA问题框架为一个图分割和分类问题。我们介绍了基于图的布局分析模型(GLAM)，这是一种轻量级的图神经网络，在两个具有挑战性的DLA数据集上与SOTA模型竞争，同时比现有模型小一个数量级。特别是，400万参数的GLAM模型在DocLayNet数据集的11个类中的5个上优于领先的基于140万+参数的计算机视觉模型。这两个模型的简单集成实现了DocLayNet上的新技术，将mAP从76.8提高到80.8。总体而言，GLAM比SOTA模型效率高5倍以上，使GLAM成为pla任务的有利工程选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量