VSLayout:用于文档布局分析的视觉语义表示学习

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2022-12-23 DOI:10.1145/3579654.3579767

Shan Wang, Jing Jiang, Yanjun Jiang, Xuesong Zhang

{"title":"VSLayout:用于文档布局分析的视觉语义表示学习","authors":"Shan Wang, Jing Jiang, Yanjun Jiang, Xuesong Zhang","doi":"10.1145/3579654.3579767","DOIUrl":null,"url":null,"abstract":"Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"336 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VSLayout: Visual-Semantic Representation Learning For Document Layout Analysis\",\"authors\":\"Shan Wang, Jing Jiang, Yanjun Jiang, Xuesong Zhang\",\"doi\":\"10.1145/3579654.3579767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"336 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579767\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

文档布局分析(Document layout analysis, DLA)旨在提取和分类结构区域，是许多下游文档理解任务中相当具有挑战性和关键的一步。尽管文本(语义)和图像(视觉)特征的融合在DLA中显示出显著的优势，但现有的方法要么需要同时输入文本-图像对，这在只有文档图像可用时不适用，要么必须采用光学字符识别(OCR)预处理。本文仅从文档的成像方式学习了DLA的视觉语义表示，极大地扩展了DLA在实际应用中的适用性。我们的方法包括三个阶段。首先，我们通过跨模态监督训练文档图像的文本特征提取器(TFE)，增强TFE输出与Sent2Vec生成的文本嵌入图之间的一致性。然后，只使用文档图像进一步调整预训练的TFE并提取浅层语义特征，这些特征将被进一步馈送到第三阶段。最后，采用双流网络提取深层语义和视觉特征，并将其融合作为检测模块(如RPN (Region Proposal network))的输入，生成最终结果。在基准数据集上，我们证明了所提出的TFE模型优于主流语义嵌入模型，并且我们的方法实现了优于基线方法的DLA性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VSLayout: Visual-Semantic Representation Learning For Document Layout Analysis

Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量