Weina Fu , Wenxiang Zhang , Jing Long , Gautam Srivastava , Shuai Liu
{"title":"弱监督时间动作定位的全局上下文感知注意模型","authors":"Weina Fu , Wenxiang Zhang , Jing Long , Gautam Srivastava , Shuai Liu","doi":"10.1016/j.aej.2025.05.006","DOIUrl":null,"url":null,"abstract":"<div><div>Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.</div></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":"127 ","pages":"Pages 43-55"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Global context-aware attention model for weakly-supervised temporal action localization\",\"authors\":\"Weina Fu , Wenxiang Zhang , Jing Long , Gautam Srivastava , Shuai Liu\",\"doi\":\"10.1016/j.aej.2025.05.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.</div></div>\",\"PeriodicalId\":7484,\"journal\":{\"name\":\"alexandria engineering journal\",\"volume\":\"127 \",\"pages\":\"Pages 43-55\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"alexandria engineering journal\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1110016825006179\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016825006179","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
Global context-aware attention model for weakly-supervised temporal action localization
Temporal action localization (TAL) is a significant and challenging task in the field of video understanding. It aims to locate the start and end timestamps of the actions in a video and recognize their categories. However, efficient action localization often requires extensive precise annotations. Therefore, the researchers propose weakly-supervised temporal action localization (WTAL), which aims to locate action instances in a video using only video level annotations. The existing WTAL methods lack the ability to distinguish the action context information effectively, including the pre-action and post-action scenes, which blur the action boundary and lead to the inaccurate action location. To solve the above problems, this paper proposes a global context-aware attention model (GCAM). Firstly, GCAM designs the mask attention module (MAM) to restrict the model's receptive field and make the model focus on localized features related to the action context. It enhances the ability to distinguish the action context information and clearly locate the start and end timestamps of the actions. Secondly, GCAM introduces the context broadcasting module (CBM), which supplements the global context information to keep the features intact in temporal dimension. This module solves the issue that the model overemphasizes the localized features due to the addition of the MAM. Extensive experiments on the THUMOS14 and ActivityNet1.2 datasets demonstrate the effectiveness of GCAM. On the THUMOS14 dataset, GCAM achieves an average mean average precision (mAP) of 49.5 %, representing a 2.2 % improvement over existing WTAL methods. On the ActivityNet1.2 dataset, GCAM achieves an average mAP of 27.2 %, representing a 0.3 % improvement over existing WTAL methods. These results highlight the superior performance of GCAM in accurately localizing actions in videos.
期刊介绍:
Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification:
• Mechanical, Production, Marine and Textile Engineering
• Electrical Engineering, Computer Science and Nuclear Engineering
• Civil and Architecture Engineering
• Chemical Engineering and Applied Sciences
• Environmental Engineering