Dual dynamic transformer for image captioning

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-16 DOI:10.1016/j.eswa.2025.128597

Chun Shan , Chuanle Song , Tongyi Zou , Jiayi Li , Shaoming Liu

{"title":"Dual dynamic transformer for image captioning","authors":"Chun Shan , Chuanle Song , Tongyi Zou , Jiayi Li , Shaoming Liu","doi":"10.1016/j.eswa.2025.128597","DOIUrl":null,"url":null,"abstract":"<div><div>The task of image captioning, widely acclaimed in the field of computer vision, aims to depict the content of an image, wielding a significant impact on people’s lives. Present methodologies for this task typically involve extracting global and local features to capture both overall and intricate details within images. However, the former, reliant on high-level, low-resolution grid features, when directly inputted into transformer encoders, may falter in establishing robust correlations between individual grids, thereby leading to the loss of relationship information between grid features. Additionally, the latter, utilizing region features derived from object detectors, may hinder transformers from comprehending the semantic relationships among regions, resulting in semantic information loss. To tackle these challenges, we introduce a novel Dual Dynamic Transformer (D<span><math><msup><mrow></mrow><mn>2</mn></msup></math></span>T) framework for image captioning, amalgamating the benefits of dynamic grid features and dynamic region features. Specifically, the Dynamic Pseudo-regions Grid (DPG) encoder enhances the strong correlation between grid features by grouping the attention of different grids and dynamically generating pseudo-regions, facilitating superior fusion with region features. Furthermore, the Dynamic Multi-Level Relation Region (DMR<span><math><msup><mrow></mrow><mn>2</mn></msup></math></span>) encoder augments the comprehension of semantic relationships among various region features through attention-based multi-level relations. In the encoding phase, to seamlessly integrate dynamic grid features and dynamic region features, we propose a feature fusion module for combining these two distinct feature types. Moreover, additional experiments conducted on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance without incurring additional parameter overhead.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"292 ","pages":"Article 128597"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742502216X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The task of image captioning, widely acclaimed in the field of computer vision, aims to depict the content of an image, wielding a significant impact on people’s lives. Present methodologies for this task typically involve extracting global and local features to capture both overall and intricate details within images. However, the former, reliant on high-level, low-resolution grid features, when directly inputted into transformer encoders, may falter in establishing robust correlations between individual grids, thereby leading to the loss of relationship information between grid features. Additionally, the latter, utilizing region features derived from object detectors, may hinder transformers from comprehending the semantic relationships among regions, resulting in semantic information loss. To tackle these challenges, we introduce a novel Dual Dynamic Transformer (D

^{2}

T) framework for image captioning, amalgamating the benefits of dynamic grid features and dynamic region features. Specifically, the Dynamic Pseudo-regions Grid (DPG) encoder enhances the strong correlation between grid features by grouping the attention of different grids and dynamically generating pseudo-regions, facilitating superior fusion with region features. Furthermore, the Dynamic Multi-Level Relation Region (DMR

^{2}

) encoder augments the comprehension of semantic relationships among various region features through attention-based multi-level relations. In the encoding phase, to seamlessly integrate dynamic grid features and dynamic region features, we propose a feature fusion module for combining these two distinct feature types. Moreover, additional experiments conducted on the MSCOCO dataset demonstrate that our model achieves state-of-the-art performance without incurring additional parameter overhead.

查看原文本刊更多论文

用于图像字幕的双动态变压器

在计算机视觉领域广受赞誉的图像字幕任务旨在描述图像的内容，对人们的生活产生重大影响。目前用于此任务的方法通常包括提取全局和局部特征，以捕获图像中的总体和复杂细节。然而，前者依赖于高水平、低分辨率的网格特征，当直接输入到变压器编码器中时，可能会在建立单个网格之间的鲁棒相关性时出现问题，从而导致网格特征之间关系信息的丢失。此外，后者利用来自目标检测器的区域特征，可能会阻碍变形器理解区域之间的语义关系，从而导致语义信息丢失。为了解决这些挑战，我们引入了一种新的双动态变压器（D2T）框架用于图像字幕，融合了动态网格特征和动态区域特征的优点。其中，动态伪区域网格（Dynamic Pseudo-regions Grid， DPG）编码器通过对不同网格的注意力进行分组并动态生成伪区域，增强了网格特征之间的强相关性，便于与区域特征进行更好的融合。此外，动态多层次关系区域（DMR2）编码器通过基于注意的多层次关系增强了对各个区域特征之间语义关系的理解。在编码阶段，为了无缝集成动态网格特征和动态区域特征，我们提出了一种特征融合模块，将这两种不同的特征类型结合起来。此外，在MSCOCO数据集上进行的其他实验表明，我们的模型在不产生额外参数开销的情况下实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.