Chun Shan , Chuanle Song , Tongyi Zou , Jiayi Li , Shaoming Liu
{"title":"Dual dynamic transformer for image captioning","authors":"Chun Shan , Chuanle Song , Tongyi Zou , Jiayi Li , Shaoming Liu","doi":"10.1016/j.eswa.2025.128597","DOIUrl":null,"url":null,"abstract":"<div><div>The task of image captioning, widely acclaimed in the field of computer vision, aims to depict the content of an image, wielding a significant impact on people’s lives. Present methodologies for this task typically involve extracting global and local features to capture both overall and intricate details within images. However, the former, reliant on high-level, low-resolution grid features, when directly inputted into transformer encoders, may falter in establishing robust correlations between individual grids, thereby leading to the loss of relationship information between grid features. Additionally, the latter, utilizing region features derived from object detectors, may hinder transformers from comprehending the semantic relationships among regions, resulting in semantic information loss. To tackle these challenges, we introduce a novel Dual Dynamic Transformer (D<span><math><msup><mrow></mrow><mn>2</mn></msup></math></span>T) framework for image captioning, amalgamating the benefits of dynamic grid features and dynamic region features. Specifically, the Dynamic Pseudo-regions Grid (DPG) encoder enhances the strong correlation between grid features by grouping the attention of different grids and dynamically generating pseudo-regions, facilitating superior fusion with region features. Furthermore, the Dynamic Multi-Level Relation Region (DMR<span><math><msup><mrow></mrow><mn>2</mn></msup></math></span>) encoder augments the comprehension of semantic relationships among various region features through attention-based multi-level relations. In the encoding phase, to seamlessly integrate dynamic grid features and dynamic region features, we propose a feature fusion module for combining these two distinct feature types. Moreover, additional experiments conducted on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance without incurring additional parameter overhead.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"292 ","pages":"Article 128597"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742502216X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The task of image captioning, widely acclaimed in the field of computer vision, aims to depict the content of an image, wielding a significant impact on people’s lives. Present methodologies for this task typically involve extracting global and local features to capture both overall and intricate details within images. However, the former, reliant on high-level, low-resolution grid features, when directly inputted into transformer encoders, may falter in establishing robust correlations between individual grids, thereby leading to the loss of relationship information between grid features. Additionally, the latter, utilizing region features derived from object detectors, may hinder transformers from comprehending the semantic relationships among regions, resulting in semantic information loss. To tackle these challenges, we introduce a novel Dual Dynamic Transformer (DT) framework for image captioning, amalgamating the benefits of dynamic grid features and dynamic region features. Specifically, the Dynamic Pseudo-regions Grid (DPG) encoder enhances the strong correlation between grid features by grouping the attention of different grids and dynamically generating pseudo-regions, facilitating superior fusion with region features. Furthermore, the Dynamic Multi-Level Relation Region (DMR) encoder augments the comprehension of semantic relationships among various region features through attention-based multi-level relations. In the encoding phase, to seamlessly integrate dynamic grid features and dynamic region features, we propose a feature fusion module for combining these two distinct feature types. Moreover, additional experiments conducted on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance without incurring additional parameter overhead.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.