融合网格和自适应区域特征的图像字幕

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-24 DOI:10.1016/j.imavis.2025.105513

Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma

{"title":"融合网格和自适应区域特征的图像字幕","authors":"Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma","doi":"10.1016/j.imavis.2025.105513","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105513"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fusing grid and adaptive region features for image captioning\",\"authors\":\"Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma\",\"doi\":\"10.1016/j.imavis.2025.105513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"157 \",\"pages\":\"Article 105513\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001015\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001015","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

图像字幕的目的是为给定的图像自动生成语法正确、合理的描述句子。改进特征优化和处理对于提高该任务的性能至关重要。一种常见的方法是利用网格特征和区域特征的互补优势。然而，在大多数现有方法中加入区域特征可能会导致训练过程中的错误指导，以及高获取成本和预缓存要求。这些因素影响了图像字幕的有效性和实际应用。为了解决这些局限性，本文提出了一种融合网格和自适应区域特征的图像字幕方法（FGAR）。FGAR基于提取的网格特征动态地探索给定图像中的伪区域信息。随后，它利用具有不同权限的计算层组合来融合特征，使来自不同模态的信息之间进行全面交互，同时保留每个模态的独特特征。由此产生的增强的视觉特征为解码器提供了更好的支持，用于自回归生成描述给定图像内容的句子。所有流程都集成在一个完整的端到端框架中，在实现令人满意的性能的同时，促进了训练和推理过程。大量的实验验证了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusing grid and adaptive region features for image captioning

Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.