SKFormer: Diagram captioning via self-knowledge enhanced multi-modal transformer

IF 3.6 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing Pub Date : 2025-07-28 DOI:10.1016/j.sigpro.2025.110208

Xin Hu , Jiaxin Wang , Tao Gao

{"title":"SKFormer: Diagram captioning via self-knowledge enhanced multi-modal transformer","authors":"Xin Hu , Jiaxin Wang , Tao Gao","doi":"10.1016/j.sigpro.2025.110208","DOIUrl":null,"url":null,"abstract":"<div><div>Diagram captioning aims to generate sentences with the assistance of key visual objects and relationships. This task is the key basis of several applications like cross-modal retrieval and textbook question answering. Most diagrams consist of simple color blocks and geometric shapes, and the knowledge conveyed is professional and diverse. This not only results in high annotation costs, but also exacerbates the gap between sparse visuals and complex semantics. In this paper, we propose a self-knowledge enhanced multi-modal Transformer denoted as SKFormer, which is based on an encoder–decoder architecture. The encoder includes a perception aggregation graph network PAG, a self-knowledge mining module SKM, and a multi-modal semantic interaction module MMSI. The PAG network takes diagram patches as nodes, and the relationships between nodes as edges, integrating visual perception laws to enhance the visual representation. The SKM utilizes multi-modal LLMs and OCR tools to mine implicit and explicit knowledge in diagrams, while the MMSI is used for the interaction between the visual and semantic contents. The enhanced diagram representation is processed by the decoder to generate sentences, assisting learners in mastering its knowledge. By conducting comprehensive experiments on two datasets, we demonstrate that the SKFormer achieves superior performance over the competitors.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"238 ","pages":"Article 110208"},"PeriodicalIF":3.6000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425003226","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Diagram captioning aims to generate sentences with the assistance of key visual objects and relationships. This task is the key basis of several applications like cross-modal retrieval and textbook question answering. Most diagrams consist of simple color blocks and geometric shapes, and the knowledge conveyed is professional and diverse. This not only results in high annotation costs, but also exacerbates the gap between sparse visuals and complex semantics. In this paper, we propose a self-knowledge enhanced multi-modal Transformer denoted as SKFormer, which is based on an encoder–decoder architecture. The encoder includes a perception aggregation graph network PAG, a self-knowledge mining module SKM, and a multi-modal semantic interaction module MMSI. The PAG network takes diagram patches as nodes, and the relationships between nodes as edges, integrating visual perception laws to enhance the visual representation. The SKM utilizes multi-modal LLMs and OCR tools to mine implicit and explicit knowledge in diagrams, while the MMSI is used for the interaction between the visual and semantic contents. The enhanced diagram representation is processed by the decoder to generate sentences, assisting learners in mastering its knowledge. By conducting comprehensive experiments on two datasets, we demonstrate that the SKFormer achieves superior performance over the competitors.

查看原文本刊更多论文

SKFormer：通过自我认知增强的多模态变压器进行图表字幕

图表标注的目的是在关键视觉对象和关系的帮助下生成句子。该任务是跨模态检索和教科书问答等应用的关键基础。大多数图表由简单的色块和几何形状组成，所传达的知识是专业和多样的。这不仅导致注释成本高，而且还加剧了稀疏视觉和复杂语义之间的差距。在本文中，我们提出了一种基于编码器-解码器架构的自知识增强多模态变压器，称为SKFormer。该编码器包括感知聚合图网络PAG、自知识挖掘模块SKM和多模态语义交互模块MMSI。PAG网络以图块为节点，节点之间的关系为边，整合视觉感知规律，增强视觉表征。SKM利用多模态llm和OCR工具来挖掘图中的隐式和显式知识，而MMSI用于可视化和语义内容之间的交互。解码器对增强的图表示进行处理生成句子，帮助学习者掌握其知识。通过在两个数据集上进行综合实验，我们证明了SKFormer比竞争对手取得了更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing 工程技术-工程：电子与电气

CiteScore

9.20

自引率

9.10%

发文量

309

审稿时长

41 days

期刊介绍： Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing. Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.