{"title":"Inter and Intra-snippet Multi-head Attention With Position Offset for Action Localization and Recognition","authors":"Himanshu Singh;Khanjan Choudhury;Badri Narayan Subudhi;Vinit Jakhetiya;T. Veerakumar","doi":"10.1109/TAI.2025.3630621","DOIUrl":null,"url":null,"abstract":"Numerous studies have focused on action localization and recognition; however, their performance suffers when applied to weakly supervised scenarios, leading to poor or rapidly declining results. This article introduces an efficient deep learning architecture based on multi-head attention to enhance action localization in untrimmed videos. Our proposed algorithm comprises three stages. Initially, a short-snippet enhancement (SSE) sampling module captures intrinsic details in video frames, adeptly balancing short-term and long-term action contributions for improved localization. The second stage employs inter-snippet and intra-snippet multi-head attention, incorporating positional offset, to capture spatio-temporal dependencies among videos and within individual video snippets, precisely identifying action boundaries. The third stage integrates an action localization network with uncertainty-guided pseudoinstance-level and video-level losses to enhance performance, mitigating the impact of noisy labels. A multistep updating process progressively refines action proposals, augmenting localization precision. To demonstrate the effectiveness of our proposed scheme, we evaluate the performance of the proposed scheme using mean average precision (mAP) over the different thresholds of intersection over union (IoU) as the evaluation measure on the “THUMOS14” and “ActivityNet-v1.3” datasets. Our algorithm achieves an mAP value of 45.20% on “THUMOS14” and an mAP value of 25.24% on “ActivityNet-v1.3.” Furthermore, we compare our technique with 24 state-of-the-art (SOTA) techniques on “THUMOS14” and eleven SOTA techniques on “ActivityNet-v1.3,” confirming the superiority of the proposed scheme.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 5","pages":"3018-3030"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11235990/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/11/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Numerous studies have focused on action localization and recognition; however, their performance suffers when applied to weakly supervised scenarios, leading to poor or rapidly declining results. This article introduces an efficient deep learning architecture based on multi-head attention to enhance action localization in untrimmed videos. Our proposed algorithm comprises three stages. Initially, a short-snippet enhancement (SSE) sampling module captures intrinsic details in video frames, adeptly balancing short-term and long-term action contributions for improved localization. The second stage employs inter-snippet and intra-snippet multi-head attention, incorporating positional offset, to capture spatio-temporal dependencies among videos and within individual video snippets, precisely identifying action boundaries. The third stage integrates an action localization network with uncertainty-guided pseudoinstance-level and video-level losses to enhance performance, mitigating the impact of noisy labels. A multistep updating process progressively refines action proposals, augmenting localization precision. To demonstrate the effectiveness of our proposed scheme, we evaluate the performance of the proposed scheme using mean average precision (mAP) over the different thresholds of intersection over union (IoU) as the evaluation measure on the “THUMOS14” and “ActivityNet-v1.3” datasets. Our algorithm achieves an mAP value of 45.20% on “THUMOS14” and an mAP value of 25.24% on “ActivityNet-v1.3.” Furthermore, we compare our technique with 24 state-of-the-art (SOTA) techniques on “THUMOS14” and eleven SOTA techniques on “ActivityNet-v1.3,” confirming the superiority of the proposed scheme.