GAVA: Spatial awareness in image captioning with geometric-aware visual attention

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-07-05 DOI:10.1016/j.dsp.2025.105435

Mohammad Alamgir Hossain , ZhongFu Ye , Md. Bipul Hossen , Md. Atiqur Rahman , Md Shohidul Islam , Md. Ibrahim Abdullah

{"title":"GAVA: Spatial awareness in image captioning with geometric-aware visual attention","authors":"Mohammad Alamgir Hossain , ZhongFu Ye , Md. Bipul Hossen , Md. Atiqur Rahman , Md Shohidul Islam , Md. Ibrahim Abdullah","doi":"10.1016/j.dsp.2025.105435","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning models often face challenges in capturing spatial relationships, which are critical for generating accurate and contextually meaningful descriptions. In this work, we propose Geometric-Aware Visual Attention (GAVA), a novel attention mechanism that integrates spatial geometry—such as object positions, sizes, and aspect ratios—directly into the attention process. GAVA improves spatial reasoning by utilizing bilinear pooling to effectively combine visual and geometric features, leading to captions that are both descriptive and spatially coherent. The proposed GAVA mechanism enhances spatial reasoning by incorporating spatial geometry into the attention framework. Additionally, we present a unified feature extraction approach that exclusively extracts geometric information, forming a representation that captures complex spatial dependencies and results in more coherent and contextually accurate captions. We demonstrate the effectiveness of GAVA through experiments on the MS-COCO dataset, where it outperforms state-of-the-art models, achieving significant improvements in BLEU, CIDEr, and SPICE scores. These results underscore GAVA's ability to capture spatial accuracy and contextual relevance, establishing a new benchmark for spatially-aware image captioning. The code for GAVA is publicly available at <span><span>https://github.com/alamgirustc/GAVA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"167 ","pages":"Article 105435"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425004579","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Image captioning models often face challenges in capturing spatial relationships, which are critical for generating accurate and contextually meaningful descriptions. In this work, we propose Geometric-Aware Visual Attention (GAVA), a novel attention mechanism that integrates spatial geometry—such as object positions, sizes, and aspect ratios—directly into the attention process. GAVA improves spatial reasoning by utilizing bilinear pooling to effectively combine visual and geometric features, leading to captions that are both descriptive and spatially coherent. The proposed GAVA mechanism enhances spatial reasoning by incorporating spatial geometry into the attention framework. Additionally, we present a unified feature extraction approach that exclusively extracts geometric information, forming a representation that captures complex spatial dependencies and results in more coherent and contextually accurate captions. We demonstrate the effectiveness of GAVA through experiments on the MS-COCO dataset, where it outperforms state-of-the-art models, achieving significant improvements in BLEU, CIDEr, and SPICE scores. These results underscore GAVA's ability to capture spatial accuracy and contextual relevance, establishing a new benchmark for spatially-aware image captioning. The code for GAVA is publicly available at https://github.com/alamgirustc/GAVA.

查看原文本刊更多论文

空间意识在图像字幕与几何意识视觉注意

图像字幕模型在捕获空间关系方面经常面临挑战，这对于生成准确和上下文有意义的描述至关重要。在这项工作中，我们提出了几何感知视觉注意（GAVA），这是一种新的注意机制，它将空间几何（如物体位置、大小和长宽比）直接集成到注意过程中。GAVA通过利用双线性池有效地结合视觉和几何特征来改进空间推理，从而使字幕既具有描述性又具有空间一致性。提出的GAVA机制通过将空间几何纳入注意框架来增强空间推理。此外，我们提出了一种统一的特征提取方法，该方法专门提取几何信息，形成捕获复杂空间依赖关系的表示，并产生更连贯和上下文准确的标题。我们通过MS-COCO数据集的实验证明了GAVA的有效性，它优于最先进的模型，在BLEU， CIDEr和SPICE分数上取得了显着改善。这些结果强调了GAVA捕捉空间准确性和上下文相关性的能力，为空间感知图像字幕建立了新的基准。GAVA的代码可在https://github.com/alamgirustc/GAVA上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,