Fei Li, Xinyu Li, Sijie Wen, Haoyang Huang, Jinsong Bao
{"title":"SAMAC-R3-MED:多模态工程文档检索者-重新排序者-应答者模型的语义对齐和多智能体协作","authors":"Fei Li, Xinyu Li, Sijie Wen, Haoyang Huang, Jinsong Bao","doi":"10.1016/j.compind.2025.104336","DOIUrl":null,"url":null,"abstract":"<div><div>In the manufacturing industry's lifecycle, a vast amount of engineering documents in text, table, and image formats is generated. Retrieval-augmented generation (RAG) models can enhance retrieval efficiency and adapt to evolving document knowledge. However, challenges in understanding multimodal semantic associations and the absence of engineering-semantic-aligned RAG models result in suboptimal accuracy. This paper introduces a novel approach, namely SAMAC-R<sup>3</sup>-MED, to tackle these challenges. First, a fine-grained context enhancement strategy is applied to multimodal large language models (MLLMs), bridging multimodal semantic understanding by constructing multi-modal semantic trees (MMST) and multi-modal knowledge graphs (MMKG), forming a hybrid retrieval base. Second, to bridge the semantic gap in RAG models, a new training framework, retriever-reranker-responder (R<sup>3</sup>), is proposed, utilizing supervised and reinforcement learning with ranking feedback to enhance alignment. Third, a multi-channel hybrid retrieval strategy is implemented for the multi-agent collaboration R<sup>3</sup> models, integrating expert feedback, semantic trees, and graphs to optimize the RAG pipeline and improve the accuracy of retrieving multimodal associative semantic contexts. An engineering documents chat (eDoChat) system is implemented, in the case of wind turbine assembly, validating the effectiveness in retrieving and generating accurate multimodal answers. Ablation experiments show R<sup>3</sup> models outperform traditional RAG models, and SAMAC-R<sup>3</sup>-MED achieves state-of-the-art results in multimodal retrieval and generation tasks.</div></div>","PeriodicalId":55219,"journal":{"name":"Computers in Industry","volume":"171 ","pages":"Article 104336"},"PeriodicalIF":9.1000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAMAC-R3-MED: Semantic alignment and multi-agent collaboration of retriever-reranker-responder models for multimodal engineering documents\",\"authors\":\"Fei Li, Xinyu Li, Sijie Wen, Haoyang Huang, Jinsong Bao\",\"doi\":\"10.1016/j.compind.2025.104336\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the manufacturing industry's lifecycle, a vast amount of engineering documents in text, table, and image formats is generated. Retrieval-augmented generation (RAG) models can enhance retrieval efficiency and adapt to evolving document knowledge. However, challenges in understanding multimodal semantic associations and the absence of engineering-semantic-aligned RAG models result in suboptimal accuracy. This paper introduces a novel approach, namely SAMAC-R<sup>3</sup>-MED, to tackle these challenges. First, a fine-grained context enhancement strategy is applied to multimodal large language models (MLLMs), bridging multimodal semantic understanding by constructing multi-modal semantic trees (MMST) and multi-modal knowledge graphs (MMKG), forming a hybrid retrieval base. Second, to bridge the semantic gap in RAG models, a new training framework, retriever-reranker-responder (R<sup>3</sup>), is proposed, utilizing supervised and reinforcement learning with ranking feedback to enhance alignment. Third, a multi-channel hybrid retrieval strategy is implemented for the multi-agent collaboration R<sup>3</sup> models, integrating expert feedback, semantic trees, and graphs to optimize the RAG pipeline and improve the accuracy of retrieving multimodal associative semantic contexts. An engineering documents chat (eDoChat) system is implemented, in the case of wind turbine assembly, validating the effectiveness in retrieving and generating accurate multimodal answers. Ablation experiments show R<sup>3</sup> models outperform traditional RAG models, and SAMAC-R<sup>3</sup>-MED achieves state-of-the-art results in multimodal retrieval and generation tasks.</div></div>\",\"PeriodicalId\":55219,\"journal\":{\"name\":\"Computers in Industry\",\"volume\":\"171 \",\"pages\":\"Article 104336\"},\"PeriodicalIF\":9.1000,\"publicationDate\":\"2025-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers in Industry\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0166361525001010\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Industry","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166361525001010","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
SAMAC-R3-MED: Semantic alignment and multi-agent collaboration of retriever-reranker-responder models for multimodal engineering documents
In the manufacturing industry's lifecycle, a vast amount of engineering documents in text, table, and image formats is generated. Retrieval-augmented generation (RAG) models can enhance retrieval efficiency and adapt to evolving document knowledge. However, challenges in understanding multimodal semantic associations and the absence of engineering-semantic-aligned RAG models result in suboptimal accuracy. This paper introduces a novel approach, namely SAMAC-R3-MED, to tackle these challenges. First, a fine-grained context enhancement strategy is applied to multimodal large language models (MLLMs), bridging multimodal semantic understanding by constructing multi-modal semantic trees (MMST) and multi-modal knowledge graphs (MMKG), forming a hybrid retrieval base. Second, to bridge the semantic gap in RAG models, a new training framework, retriever-reranker-responder (R3), is proposed, utilizing supervised and reinforcement learning with ranking feedback to enhance alignment. Third, a multi-channel hybrid retrieval strategy is implemented for the multi-agent collaboration R3 models, integrating expert feedback, semantic trees, and graphs to optimize the RAG pipeline and improve the accuracy of retrieving multimodal associative semantic contexts. An engineering documents chat (eDoChat) system is implemented, in the case of wind turbine assembly, validating the effectiveness in retrieving and generating accurate multimodal answers. Ablation experiments show R3 models outperform traditional RAG models, and SAMAC-R3-MED achieves state-of-the-art results in multimodal retrieval and generation tasks.
期刊介绍:
The objective of Computers in Industry is to present original, high-quality, application-oriented research papers that:
• Illuminate emerging trends and possibilities in the utilization of Information and Communication Technology in industry;
• Establish connections or integrations across various technology domains within the expansive realm of computer applications for industry;
• Foster connections or integrations across diverse application areas of ICT in industry.