docouter：用于文档理解的提示引导视觉转换器和混合专家连接器

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-04-15 DOI:10.1016/j.inffus.2025.103206

Jinxu Zhang, Yu Zhang

{"title":"docouter：用于文档理解的提示引导视觉转换器和混合专家连接器","authors":"Jinxu Zhang, Yu Zhang","doi":"10.1016/j.inffus.2025.103206","DOIUrl":null,"url":null,"abstract":"<div><div>Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"122 ","pages":"Article 103206"},"PeriodicalIF":15.5000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DocRouter: Prompt guided vision transformer and Mixture of Experts connector for document understanding\",\"authors\":\"Jinxu Zhang, Yu Zhang\",\"doi\":\"10.1016/j.inffus.2025.103206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"122 \",\"pages\":\"Article 103206\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525002799\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525002799","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

文档可视化问答（Document Visual Question answer， DVQA）涉及根据文档图像的内容响应查询。大型视觉语言模型（LVLMs）的出现已经在简单布局的文档上产生了令人印象深刻的结果，但是它们仍然在复杂布局的文档上挣扎，主要是因为这些模型不能准确地定位与提示相关的内容。在这项工作中，我们提出了一种两阶段文本提示融合方案。在第一阶段，采用注意门控机制将文本提示插入视觉编码器的每一层，并采用后期融合策略减少对原始模型性能的干扰。它使用文本提示来指示视觉编码器生成与文本信息更相关的视觉内容，并实现视觉和文本信息之间的细粒度对齐。在第二阶段，我们将原有的MLP投影模块替换为mix -of-experts （MoE）模块，该模块更好地将视觉信息与大语言模型对齐，并在多任务训练中提供更强的泛化能力。收集相关数据进行预训练，实现多模态对齐。此外，为了保证视觉和文本的粗粒度对齐，我们提出了一种门控交叉注意融合方法。通过这两种融合方案，模型有效地过滤掉了无关信息，提高了问答的准确性。实验结果表明，DocRouter仅使用2B模型就可以获得鲁棒性结果，其他实验分析也证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DocRouter: Prompt guided vision transformer and Mixture of Experts connector for document understanding

Document Visual Question Answering (DVQA) involves responding to queries based on the contents of document images. The emergence of large visual language models (LVLMs) has produced impressive results on documents with simple layouts, but they still struggle with documents with complex layouts, mainly because these models cannot accurately locate the content related to the prompt. In this work, we propose a two-stage text prompt fusion scheme. In the first stage, an attention-gating mechanism is used to insert the text prompt into each layer of the visual encoder, and a late fusion strategy is adopted to reduce interference with the performance of the original model. This uses the text prompt to instruct the visual encoder to produce visual content that is more relevant to the text information and to achieve fine-grained alignment between visual and textual information. In the second stage, we replace the original MLP projection module with a Mixture-of-experts (MoE) module, which better aligns the visual information with the large language model and provides stronger generalization in multi-task training. The relevant data are collected for pre-training to achieve multi-modal alignment. In addition, to ensure the coarse-grained alignment of vision and text, we propose a gated cross-attention fusion method accordingly. Through these two fusion schemes, the model effectively filters out irrelevant information and enhances accuracy in question answering. The experimental results show that DocRouter can achieve robust results with only a 2B model, and other experimental analyses show the effectiveness of our method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.