{"title":"docouter:用于文档理解的提示引导视觉转换器和混合专家连接器","authors":"Jinxu Zhang, Yu Zhang","doi":"10.1016/j.inffus.2025.103206","DOIUrl":null,"url":null,"abstract":"<div><div>Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"122 ","pages":"Article 103206"},"PeriodicalIF":15.5000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DocRouter: Prompt guided vision transformer and Mixture of Experts connector for document understanding\",\"authors\":\"Jinxu Zhang, Yu Zhang\",\"doi\":\"10.1016/j.inffus.2025.103206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"122 \",\"pages\":\"Article 103206\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525002799\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525002799","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DocRouter: Prompt guided vision transformer and Mixture of Experts connector for document understanding
Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.