{"title":"FocalFormer: Leveraging focal modulation for efficient action segmentation in egocentric videos","authors":"Jialu Xi, Shiguang Liu","doi":"10.1016/j.cag.2025.104381","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104381"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Graphics-Uk","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0097849325002225","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).
期刊介绍:
Computers & Graphics is dedicated to disseminate information on research and applications of computer graphics (CG) techniques. The journal encourages articles on:
1. Research and applications of interactive computer graphics. We are particularly interested in novel interaction techniques and applications of CG to problem domains.
2. State-of-the-art papers on late-breaking, cutting-edge research on CG.
3. Information on innovative uses of graphics principles and technologies.
4. Tutorial papers on both teaching CG principles and innovative uses of CG in education.