{"title":"VSLayout:用于文档布局分析的视觉语义表示学习","authors":"Shan Wang, Jing Jiang, Yanjun Jiang, Xuesong Zhang","doi":"10.1145/3579654.3579767","DOIUrl":null,"url":null,"abstract":"Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"336 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VSLayout: Visual-Semantic Representation Learning For Document Layout Analysis\",\"authors\":\"Shan Wang, Jing Jiang, Yanjun Jiang, Xuesong Zhang\",\"doi\":\"10.1145/3579654.3579767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"336 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579767\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VSLayout: Visual-Semantic Representation Learning For Document Layout Analysis
Document layout analysis (DLA), aiming to extract and classify the structural regions, is a rather challenging and critical step for many downstream document understanding tasks. Although the fusion of text (semantics) and image (vision) features has shown significant advantages for DLA, existing methods either require simultaneous text-image pair inputs, which is not applicable when only document images are available, or have to resort to an optical character recognition (OCR) preprocessing. This paper learns the visual-sematic representation for DLA only from the imaging modality of documents, which greatly extends the applicability of DLA to practical applications. Our method consists of three phases. Firstly, we train a text feature extractor (TFE) for document images via cross-modal supervision that enforces the coherence between the outputs of TFE and the text embedding map generated by Sent2Vec. Then the pretrained TFE gets further adapted using only the document images and extracts shallow semantic features that will be further fed into the third stage. Finally, a two-stream network is employed to extract the deep semantic and visual features, and their fusion is used as the input to a detector module, e.g., the RPN (Region Proposal Network), to generate the final results. On benchmark datasets, we demonstrate that the proposed TFE model outperforms main-stream semantic embedding counterparts and that our approach achieves superior DLA performance to baseline methods.