Weakly-supervised temporal action localization using multi-branch attention weighting

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems Pub Date : 2024-08-30 DOI:10.1007/s00530-024-01445-2

Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao

{"title":"Weakly-supervised temporal action localization using multi-branch attention weighting","authors":"Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao","doi":"10.1007/s00530-024-01445-2","DOIUrl":null,"url":null,"abstract":"<p>Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"62 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01445-2","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.

Abstract Image

查看原文本刊更多论文

利用多分支注意力加权进行弱监督时间动作定位

弱监督时态动作定位旨在仅使用视频级标签来训练一个准确而稳健的定位模型。由于缺乏帧级时空注释，现有的弱监督时空动作定位方法通常依赖多实例学习机制来定位和分类未剪辑视频中的所有动作实例。然而，这些方法只关注对分类任务最有帮助的区域，而忽略了视频中大量模糊的背景和上下文片段。我们认为，这些有争议的片段会对定位结果产生重大影响。为了缓解这一问题，我们提出了一种多分支注意力加权网络（MAW-Net），它引入了一个额外的非动作类，并集成了一个多分支注意力模块，以分别产生动作和背景注意力。此外，考虑到上下文、动作和背景之间的相关性，我们利用动作注意和背景注意的差异来构建上下文注意。最后，基于这三种注意力值，我们得到了三种新的类别激活序列，它们可以区分动作、背景和上下文。这样，我们的模型就能有效去除定位结果中的背景和上下文片段。我们在 THUMOS-14 和 Activitynet1.3 数据集上进行了广泛的实验。实验结果表明，我们的方法优于其他最先进的方法，其性能可与完全监督方法相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Systems 工程技术-计算机：理论方法

CiteScore

5.40

自引率

7.70%

发文量

148

审稿时长

4.5 months

期刊介绍： This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.