Mohammad Alamgir Hossain , ZhongFu Ye , Md. Bipul Hossen , Md. Atiqur Rahman , Md Shohidul Islam , Md. Ibrahim Abdullah
{"title":"空间意识在图像字幕与几何意识视觉注意","authors":"Mohammad Alamgir Hossain , ZhongFu Ye , Md. Bipul Hossen , Md. Atiqur Rahman , Md Shohidul Islam , Md. Ibrahim Abdullah","doi":"10.1016/j.dsp.2025.105435","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning models often face challenges in capturing spatial relationships, which are critical for generating accurate and contextually meaningful descriptions. In this work, we propose Geometric-Aware Visual Attention (GAVA), a novel attention mechanism that integrates spatial geometry—such as object positions, sizes, and aspect ratios—directly into the attention process. GAVA improves spatial reasoning by utilizing bilinear pooling to effectively combine visual and geometric features, leading to captions that are both descriptive and spatially coherent. The proposed GAVA mechanism enhances spatial reasoning by incorporating spatial geometry into the attention framework. Additionally, we present a unified feature extraction approach that exclusively extracts geometric information, forming a representation that captures complex spatial dependencies and results in more coherent and contextually accurate captions. We demonstrate the effectiveness of GAVA through experiments on the MS-COCO dataset, where it outperforms state-of-the-art models, achieving significant improvements in BLEU, CIDEr, and SPICE scores. These results underscore GAVA's ability to capture spatial accuracy and contextual relevance, establishing a new benchmark for spatially-aware image captioning. The code for GAVA is publicly available at <span><span>https://github.com/alamgirustc/GAVA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"167 ","pages":"Article 105435"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GAVA: Spatial awareness in image captioning with geometric-aware visual attention\",\"authors\":\"Mohammad Alamgir Hossain , ZhongFu Ye , Md. Bipul Hossen , Md. Atiqur Rahman , Md Shohidul Islam , Md. Ibrahim Abdullah\",\"doi\":\"10.1016/j.dsp.2025.105435\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image captioning models often face challenges in capturing spatial relationships, which are critical for generating accurate and contextually meaningful descriptions. In this work, we propose Geometric-Aware Visual Attention (GAVA), a novel attention mechanism that integrates spatial geometry—such as object positions, sizes, and aspect ratios—directly into the attention process. GAVA improves spatial reasoning by utilizing bilinear pooling to effectively combine visual and geometric features, leading to captions that are both descriptive and spatially coherent. The proposed GAVA mechanism enhances spatial reasoning by incorporating spatial geometry into the attention framework. Additionally, we present a unified feature extraction approach that exclusively extracts geometric information, forming a representation that captures complex spatial dependencies and results in more coherent and contextually accurate captions. We demonstrate the effectiveness of GAVA through experiments on the MS-COCO dataset, where it outperforms state-of-the-art models, achieving significant improvements in BLEU, CIDEr, and SPICE scores. These results underscore GAVA's ability to capture spatial accuracy and contextual relevance, establishing a new benchmark for spatially-aware image captioning. The code for GAVA is publicly available at <span><span>https://github.com/alamgirustc/GAVA</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"167 \",\"pages\":\"Article 105435\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425004579\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425004579","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
GAVA: Spatial awareness in image captioning with geometric-aware visual attention
Image captioning models often face challenges in capturing spatial relationships, which are critical for generating accurate and contextually meaningful descriptions. In this work, we propose Geometric-Aware Visual Attention (GAVA), a novel attention mechanism that integrates spatial geometry—such as object positions, sizes, and aspect ratios—directly into the attention process. GAVA improves spatial reasoning by utilizing bilinear pooling to effectively combine visual and geometric features, leading to captions that are both descriptive and spatially coherent. The proposed GAVA mechanism enhances spatial reasoning by incorporating spatial geometry into the attention framework. Additionally, we present a unified feature extraction approach that exclusively extracts geometric information, forming a representation that captures complex spatial dependencies and results in more coherent and contextually accurate captions. We demonstrate the effectiveness of GAVA through experiments on the MS-COCO dataset, where it outperforms state-of-the-art models, achieving significant improvements in BLEU, CIDEr, and SPICE scores. These results underscore GAVA's ability to capture spatial accuracy and contextual relevance, establishing a new benchmark for spatially-aware image captioning. The code for GAVA is publicly available at https://github.com/alamgirustc/GAVA.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,