Global context-aware attention model for weakly-supervised temporal action localization

IF 6.2 2区 工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY
Weina Fu , Wenxiang Zhang , Jing Long , Gautam Srivastava , Shuai Liu
{"title":"Global context-aware attention model for weakly-supervised temporal action localization","authors":"Weina Fu ,&nbsp;Wenxiang Zhang ,&nbsp;Jing Long ,&nbsp;Gautam Srivastava ,&nbsp;Shuai Liu","doi":"10.1016/j.aej.2025.05.006","DOIUrl":null,"url":null,"abstract":"<div><div>Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.</div></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":"127 ","pages":"Pages 43-55"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016825006179","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.
弱监督时间动作定位的全局上下文感知注意模型
时间动作定位(TAL)是视频理解领域的一项重要而富有挑战性的任务。它旨在定位视频中动作的开始和结束时间戳,并识别它们的类别。然而,高效的动作本地化通常需要大量精确的注释。因此,研究人员提出了弱监督时间动作定位(WTAL),其目的是仅使用视频级别的注释来定位视频中的动作实例。现有的WTAL方法缺乏有效区分动作上下文信息的能力,包括动作前和动作后场景,模糊了动作边界,导致动作定位不准确。为了解决上述问题,本文提出了一种全局上下文感知注意模型(GCAM)。首先,GCAM设计了面具注意模块(mask attention module, MAM)来限制模型的接受野,使模型专注于与动作上下文相关的局部特征;它增强了区分操作上下文信息和清楚地定位操作的开始和结束时间戳的能力。其次,GCAM引入了上下文广播模块(CBM),对全局上下文信息进行补充,使特征在时间维度上保持完整;该模块解决了由于MAM的加入而导致模型过于强调局部特征的问题。在THUMOS14和ActivityNet1.2数据集上的大量实验证明了GCAM的有效性。在THUMOS14数据集上,GCAM实现了49.5 %的平均平均精度(mAP),比现有的WTAL方法提高了2.2 %。在ActivityNet1.2数据集上,GCAM实现了27.2 %的平均mAP,比现有的WTAL方法提高了0.3 %。这些结果突出了GCAM在视频中精确定位动作方面的优越性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
alexandria engineering journal
alexandria engineering journal Engineering-General Engineering
CiteScore
11.20
自引率
4.40%
发文量
1015
审稿时长
43 days
期刊介绍: Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification: • Mechanical, Production, Marine and Textile Engineering • Electrical Engineering, Computer Science and Nuclear Engineering • Civil and Architecture Engineering • Chemical Engineering and Applied Sciences • Environmental Engineering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信