带门控周期的多粒度表示聚合变换器，用于更改字幕

IF 6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-22 DOI:10.1145/3660346

Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu

{"title":"带门控周期的多粒度表示聚合变换器，用于更改字幕","authors":"Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu","doi":"10.1145/3660346","DOIUrl":null,"url":null,"abstract":"<p>Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"81 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning\",\"authors\":\"Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu\",\"doi\":\"10.1145/3660346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.</p>\",\"PeriodicalId\":50937,\"journal\":{\"name\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"volume\":\"81 1\",\"pages\":\"\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3660346\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3660346","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

变化字幕旨在用自然语言描述图像对中的差异，它将视觉理解和语言生成结合在一起。尽管已经取得了重大进展，但从不同视角感知物体变化，尤其是视角急剧变化的严重情况，仍然是一个关键挑战。在本文中，我们提出了一种新颖的全注意力网络，即多粒度表征聚合转换器（MURAT），以区分实际变化和视角变化。具体来说，对编码器首先以多层次的方式捕捉成对对象之间的相似语义，并将其视为区分无关变化的语义线索。然后，设计出一种新颖的多粒度表征聚合器（MRA），利用粗粒度和细粒度语义线索构建可靠的差异表征。最后，语言解码器根据 MRA 的输出生成对变化的描述。此外，我们还引入了门控循环机制（Gating Cycle Mechanism），通过反向操作来促进差异表征学习与语言生成之间的语义一致性，从而弥合变化特征与文本特征之间的语义鸿沟。广泛的实验证明，所提出的 MURAT 能够在干扰无关变化的情况下大大提高描述实际变化的能力，并在 CLEVR-Change、CLEVR-DC 和 Spot-the-Diff 三个基准测试中取得了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning

Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.