Lingyun Song , Wenqing Du , Xiaolin Han , Xinbiao Gan , Xiaoqi Wang , Xuequn Shang
{"title":"Cross-modal multi-relational graph reasoning: A novel model for multimodal textbook comprehension","authors":"Lingyun Song , Wenqing Du , Xiaolin Han , Xinbiao Gan , Xiaoqi Wang , Xuequn Shang","doi":"10.1016/j.inffus.2025.103082","DOIUrl":null,"url":null,"abstract":"<div><div>The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR’s performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":14.7000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525001551","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR’s performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.