{"title":"SKFormer:通过自我认知增强的多模态变压器进行图表字幕","authors":"Xin Hu , Jiaxin Wang , Tao Gao","doi":"10.1016/j.sigpro.2025.110208","DOIUrl":null,"url":null,"abstract":"<div><div>Diagram captioning aims to generate sentences with the assistance of key visual objects and relationships. This task is the key basis of several applications like cross-modal retrieval and textbook question answering. Most diagrams consist of simple color blocks and geometric shapes, and the knowledge conveyed is professional and diverse. This not only results in high annotation costs, but also exacerbates the gap between sparse visuals and complex semantics. In this paper, we propose a self-knowledge enhanced multi-modal Transformer denoted as SKFormer, which is based on an encoder–decoder architecture. The encoder includes a perception aggregation graph network PAG, a self-knowledge mining module SKM, and a multi-modal semantic interaction module MMSI. The PAG network takes diagram patches as nodes, and the relationships between nodes as edges, integrating visual perception laws to enhance the visual representation. The SKM utilizes multi-modal LLMs and OCR tools to mine implicit and explicit knowledge in diagrams, while the MMSI is used for the interaction between the visual and semantic contents. The enhanced diagram representation is processed by the decoder to generate sentences, assisting learners in mastering its knowledge. By conducting comprehensive experiments on two datasets, we demonstrate that the SKFormer achieves superior performance over the competitors.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"238 ","pages":"Article 110208"},"PeriodicalIF":3.6000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SKFormer: Diagram captioning via self-knowledge enhanced multi-modal transformer\",\"authors\":\"Xin Hu , Jiaxin Wang , Tao Gao\",\"doi\":\"10.1016/j.sigpro.2025.110208\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Diagram captioning aims to generate sentences with the assistance of key visual objects and relationships. This task is the key basis of several applications like cross-modal retrieval and textbook question answering. Most diagrams consist of simple color blocks and geometric shapes, and the knowledge conveyed is professional and diverse. This not only results in high annotation costs, but also exacerbates the gap between sparse visuals and complex semantics. In this paper, we propose a self-knowledge enhanced multi-modal Transformer denoted as SKFormer, which is based on an encoder–decoder architecture. The encoder includes a perception aggregation graph network PAG, a self-knowledge mining module SKM, and a multi-modal semantic interaction module MMSI. The PAG network takes diagram patches as nodes, and the relationships between nodes as edges, integrating visual perception laws to enhance the visual representation. The SKM utilizes multi-modal LLMs and OCR tools to mine implicit and explicit knowledge in diagrams, while the MMSI is used for the interaction between the visual and semantic contents. The enhanced diagram representation is processed by the decoder to generate sentences, assisting learners in mastering its knowledge. By conducting comprehensive experiments on two datasets, we demonstrate that the SKFormer achieves superior performance over the competitors.</div></div>\",\"PeriodicalId\":49523,\"journal\":{\"name\":\"Signal Processing\",\"volume\":\"238 \",\"pages\":\"Article 110208\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0165168425003226\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425003226","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
SKFormer: Diagram captioning via self-knowledge enhanced multi-modal transformer
Diagram captioning aims to generate sentences with the assistance of key visual objects and relationships. This task is the key basis of several applications like cross-modal retrieval and textbook question answering. Most diagrams consist of simple color blocks and geometric shapes, and the knowledge conveyed is professional and diverse. This not only results in high annotation costs, but also exacerbates the gap between sparse visuals and complex semantics. In this paper, we propose a self-knowledge enhanced multi-modal Transformer denoted as SKFormer, which is based on an encoder–decoder architecture. The encoder includes a perception aggregation graph network PAG, a self-knowledge mining module SKM, and a multi-modal semantic interaction module MMSI. The PAG network takes diagram patches as nodes, and the relationships between nodes as edges, integrating visual perception laws to enhance the visual representation. The SKM utilizes multi-modal LLMs and OCR tools to mine implicit and explicit knowledge in diagrams, while the MMSI is used for the interaction between the visual and semantic contents. The enhanced diagram representation is processed by the decoder to generate sentences, assisting learners in mastering its knowledge. By conducting comprehensive experiments on two datasets, we demonstrate that the SKFormer achieves superior performance over the competitors.
期刊介绍:
Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing.
Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.