利用记忆差异编码和注意力进行基于组别的独特图像字幕制作

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-10-08 DOI:10.1007/s11263-024-02220-6

Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan

{"title":"利用记忆差异编码和注意力进行基于组别的独特图像字幕制作","authors":"Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan","doi":"10.1007/s11263-024-02220-6","DOIUrl":null,"url":null,"abstract":"<p>Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention\",\"authors\":\"Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan\",\"doi\":\"10.1007/s11263-024-02220-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.</p>\",\"PeriodicalId\":13752,\"journal\":{\"name\":\"International Journal of Computer Vision\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2024-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11263-024-02220-6\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02220-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

图像标题制作的最新进展主要集中在通过大幅增加数据集和模型规模来提高准确性。虽然传统的字幕模型在 BLEU、CIDEr 和 SPICE 等既定指标上表现出很高的性能，但字幕将目标图像与其他类似图像区分开来的能力还未得到充分探索。为了生成与众不同的标题，一些先驱者采用了对比学习或重新加权地面实况标题的方法。然而，这些方法往往忽略了相似图像组中对象之间的关系（例如，同一相册中的项目或属性或细粒度事件）。在本文中，我们介绍了一种增强图像标题独特性的新方法，即基于组的差异化标题方法，该方法将每张图像与一个相似组中的其他图像进行直观比较，并突出每张图像的独特性。特别是，我们引入了一个基于组的差异记忆注意力（GDMA）模块，旨在识别和强调图像中在其图像组内可独特区分的物体特征，即那些与其他图像中的物体相似度较低的特征。这一机制可确保在为图像生成标题时优先考虑这些独特的对象特征，从而增强生成的标题的独特性。为了进一步完善这一过程，我们从地面实况字幕中选择了一些独特的词语来指导语言解码器和 GDMA 模块。此外，我们还提出了一个新的评估指标--独特词率（DisWordRate），用于定量评估字幕的独特性。定量结果表明，所提出的方法显著提高了几个基线模型的显著性，并在不过分牺牲准确性的情况下实现了最先进的显著性性能。此外，我们的用户研究结果与定量评估结果一致，证明了新指标 DisWordRate 的合理性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention

查看原文本刊更多论文

Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention

Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.