FocalFormer: Leveraging focal modulation for efficient action segmentation in egocentric videos

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computers & Graphics-Uk Pub Date : 2025-09-12 DOI:10.1016/j.cag.2025.104381

Jialu Xi, Shiguang Liu

{"title":"FocalFormer: Leveraging focal modulation for efficient action segmentation in egocentric videos","authors":"Jialu Xi, Shiguang Liu","doi":"10.1016/j.cag.2025.104381","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104381"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Graphics-Uk","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0097849325002225","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of various emerging devices (e.g., AR/VR) and video dissemination technologies, self-centered video tasks have received much attention, and it is especially important to understand user actions in self-centered videos, where self-centered temporal action segmentation complicates the task due to its unique challenges such as abrupt point-of-view shifts and limited field of view. Existing work employs Transformer-based architectures to model long-range dependencies in sequential data. However, these models often struggle to effectively accommodate the nuances of egocentric action segmentation and incur significant computational costs. Therefore, we propose a new framework that integrates focus modulation into the Transformer architecture. Unlike the traditional self-attention mechanism, which focuses uniformly on all features in the entire sequence, focus modulation replaces the self-attention layer with a more focused and efficient mechanism. This design allows for selective aggregation of local features and adaptive integration of global context through content-aware gating, which is critical for capturing detailed local motion (e.g., hand-object interactions) and handling dynamic context changes in first-person video. Our model also adds a context integration module, where focus modulation ensures that only relevant global contexts are integrated based on the content of the current frame, ultimately efficiently decoding aggregated features to produce accurate temporal action boundaries. By using focus modulation, our model achieves a lightweight design that reduces the number of parameters typically associated with Transformer-based models. We validate the effectiveness of our approach on classical datasets for temporal segmentation tasks (50salads, breakfast) as well as additional datasets with a first-person perspective (GTEA, HOI4D, and FineBio).

Abstract Image

查看原文本刊更多论文

FocalFormer：利用焦点调制在以自我为中心的视频中进行有效的动作分割

随着各种新兴设备（如AR/VR）和视频传播技术的发展，以自我为中心的视频任务受到了越来越多的关注，理解以自我为中心的视频中的用户动作尤为重要，而以自我为中心的时间动作分割由于其独特的挑战，如突然的视角转换和有限的视场，使任务复杂化。现有的工作使用基于transformer的体系结构对顺序数据中的长期依赖关系进行建模。然而，这些模型往往难以有效地适应以自我为中心的动作分割的细微差别，并产生显著的计算成本。因此，我们提出了一个将焦点调制集成到Transformer架构中的新框架。与传统的自注意机制统一地关注整个序列中的所有特征不同，焦点调制以一种更集中、更高效的机制取代了自注意层。这种设计允许局部特征的选择性聚合和通过内容感知门控的全局上下文的自适应集成，这对于捕获详细的局部运动（例如，手-对象交互）和处理第一人称视频中的动态上下文变化至关重要。我们的模型还增加了一个上下文集成模块，其中焦点调制确保仅基于当前框架的内容集成相关的全局上下文，最终有效地解码聚合特征以产生准确的时间动作边界。通过使用焦点调制，我们的模型实现了轻量级设计，减少了通常与基于transformer的模型相关的参数数量。我们验证了我们的方法在经典数据集上的有效性，用于时间分割任务（50沙拉，早餐）以及具有第一人称视角的其他数据集（GTEA， HOI4D和FineBio）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Graphics-Uk 工程技术-计算机：软件工程

CiteScore

5.30

自引率

12.00%

发文量

173

审稿时长

38 days

期刊介绍： Computers & Graphics is dedicated to disseminate information on research and applications of computer graphics (CG) techniques. The journal encourages articles on: 1. Research and applications of interactive computer graphics. We are particularly interested in novel interaction techniques and applications of CG to problem domains. 2. State-of-the-art papers on late-breaking, cutting-edge research on CG. 3. Information on innovative uses of graphics principles and technologies. 4. Tutorial papers on both teaching CG principles and innovative uses of CG in education.