Inter and Intra-snippet Multi-head Attention With Position Offset for Action Localization and Recognition

IEEE transactions on artificial intelligence Pub Date : 2026-03-01 Epub Date: 2025-11-10 DOI:10.1109/TAI.2025.3630621

Himanshu Singh;Khanjan Choudhury;Badri Narayan Subudhi;Vinit Jakhetiya;T. Veerakumar

{"title":"Inter and Intra-snippet Multi-head Attention With Position Offset for Action Localization and Recognition","authors":"Himanshu Singh;Khanjan Choudhury;Badri Narayan Subudhi;Vinit Jakhetiya;T. Veerakumar","doi":"10.1109/TAI.2025.3630621","DOIUrl":null,"url":null,"abstract":"Numerous studies have focused on action localization and recognition; however, their performance suffers when applied to weakly supervised scenarios, leading to poor or rapidly declining results. This article introduces an efficient deep learning architecture based on multi-head attention to enhance action localization in untrimmed videos. Our proposed algorithm comprises three stages. Initially, a short-snippet enhancement (SSE) sampling module captures intrinsic details in video frames, adeptly balancing short-term and long-term action contributions for improved localization. The second stage employs inter-snippet and intra-snippet multi-head attention, incorporating positional offset, to capture spatio-temporal dependencies among videos and within individual video snippets, precisely identifying action boundaries. The third stage integrates an action localization network with uncertainty-guided pseudoinstance-level and video-level losses to enhance performance, mitigating the impact of noisy labels. A multistep updating process progressively refines action proposals, augmenting localization precision. To demonstrate the effectiveness of our proposed scheme, we evaluate the performance of the proposed scheme using mean average precision (mAP) over the different thresholds of intersection over union (IoU) as the evaluation measure on the “THUMOS14” and “ActivityNet-v1.3” datasets. Our algorithm achieves an mAP value of 45.20% on “THUMOS14” and an mAP value of 25.24% on “ActivityNet-v1.3.” Furthermore, we compare our technique with 24 state-of-the-art (SOTA) techniques on “THUMOS14” and eleven SOTA techniques on “ActivityNet-v1.3,” confirming the superiority of the proposed scheme.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 5","pages":"3018-3030"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11235990/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/11/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Numerous studies have focused on action localization and recognition; however, their performance suffers when applied to weakly supervised scenarios, leading to poor or rapidly declining results. This article introduces an efficient deep learning architecture based on multi-head attention to enhance action localization in untrimmed videos. Our proposed algorithm comprises three stages. Initially, a short-snippet enhancement (SSE) sampling module captures intrinsic details in video frames, adeptly balancing short-term and long-term action contributions for improved localization. The second stage employs inter-snippet and intra-snippet multi-head attention, incorporating positional offset, to capture spatio-temporal dependencies among videos and within individual video snippets, precisely identifying action boundaries. The third stage integrates an action localization network with uncertainty-guided pseudoinstance-level and video-level losses to enhance performance, mitigating the impact of noisy labels. A multistep updating process progressively refines action proposals, augmenting localization precision. To demonstrate the effectiveness of our proposed scheme, we evaluate the performance of the proposed scheme using mean average precision (mAP) over the different thresholds of intersection over union (IoU) as the evaluation measure on the “THUMOS14” and “ActivityNet-v1.3” datasets. Our algorithm achieves an mAP value of 45.20% on “THUMOS14” and an mAP value of 25.24% on “ActivityNet-v1.3.” Furthermore, we compare our technique with 24 state-of-the-art (SOTA) techniques on “THUMOS14” and eleven SOTA techniques on “ActivityNet-v1.3,” confirming the superiority of the proposed scheme.

查看原文本刊更多论文

基于位置偏移的片段间和片段内多头注意动作定位和识别

许多研究集中在动作定位和识别；然而，当应用于弱监督场景时，它们的性能会受到影响，导致结果不佳或迅速下降。本文介绍了一种基于多头注意的高效深度学习架构，用于增强未修剪视频的动作定位。我们提出的算法包括三个阶段。最初，一个短片段增强（SSE）采样模块捕获视频帧中的内在细节，熟练地平衡短期和长期的动作贡献，以改进定位。第二阶段采用片段间和片段内多头注意力，结合位置偏移，捕捉视频之间和单个视频片段内的时空依赖性，精确识别动作边界。第三阶段集成了带有不确定性引导的伪实例级和视频级损失的动作定位网络，以提高性能，减轻噪声标签的影响。一个多步骤的更新过程逐步改进了行动建议，提高了定位精度。为了证明我们提出的方案的有效性，我们在“THUMOS14”和“ActivityNet-v1.3”数据集上使用不同阈值的平均平均精度（mAP）作为评价指标来评估所提出方案的性能。我们的算法在“THUMOS14”上的mAP值为45.20%，在“ActivityNet-v1.3”上的mAP值为25.24%。此外，我们将我们的技术与“THUMOS14”上的24种最先进（SOTA）技术和“ActivityNet-v1.3”上的11种SOTA技术进行了比较，证实了所提出方案的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量