用于高级视觉字幕的属性细化注意力融合网络

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-03-14 DOI:10.1016/j.dsp.2025.105155

Md. Bipul Hossen , Zhongfu Ye , Md. Shamim Hossain , Md. Imran Hossain

{"title":"用于高级视觉字幕的属性细化注意力融合网络","authors":"Md. Bipul Hossen , Zhongfu Ye , Md. Shamim Hossain , Md. Imran Hossain","doi":"10.1016/j.dsp.2025.105155","DOIUrl":null,"url":null,"abstract":"<div><div>Visual captioning, at the nexus of computer vision and natural language processing, is one of the pivotal aspects of multimedia content understanding, demands precise and contextually fitting image descriptions. Attribute-based approaches with attention mechanisms are effective in this realm. However, many of these approaches struggle to capture semantic details due to the prediction of irrelevant attributes and reduced performance. In response to these challenges, we propose an innovative solution: the Attribute Refinement Attention Fusion Network (ARAFNet), which aims to produce significant captions by distinctly identifying major objects and background information. The model features a comprehensive Attribute Refinement Attention (ARA) module, equipped with an attribute attention mechanism, which interactively extracts the most important attributes according to the linguistic context. Diverse attributes are employed at different time steps, enhancing the model's capability to utilize semantic features effectively while also filtering out irrelevant attribute words, thereby enhancing the precision of semantic guidance. An integrated fusion mechanism is then introduced to narrow the semantic gap between visual and attribute features. Finally, this fusion mechanism combined with the language LSTM to generate precise and contextually relevant captions. Extensive experimentation demonstrates our model's superiority over advanced counterparts, achieving an average CIDEr-D score of 11.88% on the Flickr30K dataset and 11.25% on the MS-COCO dataset through cross-entropy optimization. The ARAFNet model consistently outperforms the baseline model across a diverse range of evaluation metrics and makes a significant contribution to the field of image captioning precision. The implementing code and associated materials will be published at <span><span>https://github.com/mdbipu/ARAFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"162 ","pages":"Article 105155"},"PeriodicalIF":2.9000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ARAFNet: An attribute refinement attention fusion network for advanced visual captioning\",\"authors\":\"Md. Bipul Hossen , Zhongfu Ye , Md. Shamim Hossain , Md. Imran Hossain\",\"doi\":\"10.1016/j.dsp.2025.105155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Visual captioning, at the nexus of computer vision and natural language processing, is one of the pivotal aspects of multimedia content understanding, demands precise and contextually fitting image descriptions. Attribute-based approaches with attention mechanisms are effective in this realm. However, many of these approaches struggle to capture semantic details due to the prediction of irrelevant attributes and reduced performance. In response to these challenges, we propose an innovative solution: the Attribute Refinement Attention Fusion Network (ARAFNet), which aims to produce significant captions by distinctly identifying major objects and background information. The model features a comprehensive Attribute Refinement Attention (ARA) module, equipped with an attribute attention mechanism, which interactively extracts the most important attributes according to the linguistic context. Diverse attributes are employed at different time steps, enhancing the model's capability to utilize semantic features effectively while also filtering out irrelevant attribute words, thereby enhancing the precision of semantic guidance. An integrated fusion mechanism is then introduced to narrow the semantic gap between visual and attribute features. Finally, this fusion mechanism combined with the language LSTM to generate precise and contextually relevant captions. Extensive experimentation demonstrates our model's superiority over advanced counterparts, achieving an average CIDEr-D score of 11.88% on the Flickr30K dataset and 11.25% on the MS-COCO dataset through cross-entropy optimization. The ARAFNet model consistently outperforms the baseline model across a diverse range of evaluation metrics and makes a significant contribution to the field of image captioning precision. The implementing code and associated materials will be published at <span><span>https://github.com/mdbipu/ARAFNet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"162 \",\"pages\":\"Article 105155\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425001770\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425001770","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

视觉字幕是计算机视觉和自然语言处理的结合，是多媒体内容理解的关键方面之一，它要求精确的、符合上下文的图像描述。带有注意机制的基于属性的方法在这一领域是有效的。然而，由于预测不相关属性和降低性能，这些方法中的许多都难以捕获语义细节。为了应对这些挑战，我们提出了一种创新的解决方案：属性细化注意力融合网络（ARAFNet），该网络旨在通过清晰地识别主要对象和背景信息来产生重要的标题。该模型具有全面的属性细化注意（ARA）模块，配备属性注意机制，根据语言语境交互式地提取最重要的属性。在不同的时间步长使用不同的属性，增强了模型有效利用语义特征的能力，同时也过滤掉了不相关的属性词，从而提高了语义引导的精度。然后引入一种集成的融合机制来缩小视觉特征和属性特征之间的语义差距。最后，将该融合机制与语言LSTM相结合，生成精确且与上下文相关的字幕。大量的实验证明了我们的模型优于先进的同类模型，通过交叉熵优化，在Flickr30K数据集上的平均CIDEr-D分数为11.88%，在MS-COCO数据集上的平均CIDEr-D分数为11.25%。ARAFNet模型在不同的评估指标范围内始终优于基线模型，并对图像字幕精度领域做出了重大贡献。实施代码和相关材料将在https://github.com/mdbipu/ARAFNet上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ARAFNet: An attribute refinement attention fusion network for advanced visual captioning

Visual captioning, at the nexus of computer vision and natural language processing, is one of the pivotal aspects of multimedia content understanding, demands precise and contextually fitting image descriptions. Attribute-based approaches with attention mechanisms are effective in this realm. However, many of these approaches struggle to capture semantic details due to the prediction of irrelevant attributes and reduced performance. In response to these challenges, we propose an innovative solution: the Attribute Refinement Attention Fusion Network (ARAFNet), which aims to produce significant captions by distinctly identifying major objects and background information. The model features a comprehensive Attribute Refinement Attention (ARA) module, equipped with an attribute attention mechanism, which interactively extracts the most important attributes according to the linguistic context. Diverse attributes are employed at different time steps, enhancing the model's capability to utilize semantic features effectively while also filtering out irrelevant attribute words, thereby enhancing the precision of semantic guidance. An integrated fusion mechanism is then introduced to narrow the semantic gap between visual and attribute features. Finally, this fusion mechanism combined with the language LSTM to generate precise and contextually relevant captions. Extensive experimentation demonstrates our model's superiority over advanced counterparts, achieving an average CIDEr-D score of 11.88% on the Flickr30K dataset and 11.25% on the MS-COCO dataset through cross-entropy optimization. The ARAFNet model consistently outperforms the baseline model across a diverse range of evaluation metrics and makes a significant contribution to the field of image captioning precision. The implementing code and associated materials will be published at https://github.com/mdbipu/ARAFNet.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,