Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma
{"title":"Fusing grid and adaptive region features for image captioning","authors":"Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma","doi":"10.1016/j.imavis.2025.105513","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105513"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001015","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.