视频中时间语言基础的跨模态混合注意网络

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00259

Wen Wang, Ling Zhong, Guang Gao, Minhong Wan, J. Gu

{"title":"视频中时间语言基础的跨模态混合注意网络","authors":"Wen Wang, Ling Zhong, Guang Gao, Minhong Wan, J. Gu","doi":"10.1109/ICME55011.2023.00259","DOIUrl":null,"url":null,"abstract":"The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos\",\"authors\":\"Wen Wang, Ling Zhong, Guang Gao, Minhong Wan, J. Gu\",\"doi\":\"10.1109/ICME55011.2023.00259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

时态语言基础(TLG)任务的目标是在未修剪的视频中，根据给定的句子查询，对语义最匹配的视频片段进行时态定位。如何有效地整合视频与语言的跨模态交互是提高接地效果的关键。为了更好地将复杂的视频内容和目标情态的句子查询关联起来，以往的方法主要是通过计算每个帧-词对之间的注意矩阵来学习相关性，而忽略了基于一种情态的全局语义。本文提出了一种新的跨模态混合注意网络，该网络集成了两个并行的注意融合模块，以利用每个模态的语义和跨模态之间的相互作用。一种是模态内注意融合，它利用封闭的自我注意来捕捉在另一模态条件下的逐帧和逐词关系。另一种是多模态注意融合，利用不同模态的查询和关键特征计算共同注意权值，进一步促进多模态融合。实验结果表明，在三个具有挑战性的数据集(ActivityNet Captions, Charades-STA和TACOS)上，我们的CHAN显著优于现有的几个最先进的技术，证明了我们提出的方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos

The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量