SeaCap：用于一级图像捕获器的多视点嵌入和对齐

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-20 DOI:10.1109/TMM.2025.3535303

Bo Wang;Zhao Zhang;Mingbo Zhao;Xiaojie Jin;Mingliang Xu;Meng Wang

{"title":"SeaCap：用于一级图像捕获器的多视点嵌入和对齐","authors":"Bo Wang;Zhao Zhang;Mingbo Zhao;Xiaojie Jin;Mingliang Xu;Meng Wang","doi":"10.1109/TMM.2025.3535303","DOIUrl":null,"url":null,"abstract":"Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3411-3425"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner\",\"authors\":\"Bo Wang;Zhao Zhang;Mingbo Zhao;Xiaojie Jin;Mingliang Xu;Meng Wang\",\"doi\":\"10.1109/TMM.2025.3535303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3411-3425\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10896876/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10896876/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目前主流的图像字幕方法通常采用两阶段字幕，即通过预训练的检测器计算给定图像的对象特征，然后将其输入到语言模型中生成描述性句子。然而，这种两阶段的过程会导致基于任务的信息缺口，从而降低标题器的性能，因为从检测任务中学习到的对象特征是次优表征，不能为后续句子生成提供所有必要的信息。此外，目标特征通常由检测器的最后一个池化特征表示，这些特征丢失了图像的局部细节。本文提出了一种新的基于动态多视觉嵌入和对齐的单阶段图像捕获器（SeaCap），它将输入图像直接转换为描述性句子，从而消除了信息缺口。具体来说，为了获得丰富的特征，我们使用Swin Transformer捕获多层次特征，然后使用视点对齐模块来缓解视觉混淆，然后将它们输入到一种新的动态多视点嵌入模块中，以利用输入图像的全局结构和局部纹理。为了增强视觉编码器的全局建模能力，我们提出了一种新的二维细化模块来对嵌入特征之间的交互进行非局部建模。因此，SeaCap可以获得丰富而有用的信息，以提高捕集器的性能。在MS-COCO、Flickr8K和flickr30k的基准数据集上进行了广泛的比较，验证了我们的方法的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner

Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.