Domain generalization using action sequences for egocentric action recognition

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-06-24 DOI:10.1016/j.patrec.2025.06.010

Amirshayan Nasirimajd, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone, Giuseppe Averta, Barbara Caputo

{"title":"Domain generalization using action sequences for egocentric action recognition","authors":"Amirshayan Nasirimajd, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone, Giuseppe Averta, Barbara Caputo","doi":"10.1016/j.patrec.2025.06.010","DOIUrl":null,"url":null,"abstract":"<div><div>Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging <em>action sequences</em>, we aim to enhance the model’s generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model’s robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition. Code and Data: <span><span>Github.com/Ashayan97/SeqDG</span><svg><path></path></svg></span></div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"196 ","pages":"Pages 213-220"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002387","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model’s generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model’s robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition. Code and Data: Github.com/Ashayan97/SeqDG

查看原文本刊更多论文

基于动作序列的自中心动作识别领域泛化

从视觉输入中识别人类活动，特别是通过第一人称视角，对于使机器人能够复制人类行为至关重要。以自我为中心的视觉以观察者佩戴的相机为特征，捕捉光照、视点和环境的多种变化。这种可变性导致自我中心行为识别模型在训练中没有看到的环境中测试时的性能显著下降。在本文中，我们通过提出一种以自我为中心的行为识别的领域泛化方法来解决这些挑战。我们的见解是，动作序列通常反映跨视觉域的一致用户意图。通过利用动作序列，我们的目标是增强模型在未知环境中的泛化能力。我们提出的方法名为SeqDG，它引入了一个视觉文本序列重建目标（SeqRec），该目标使用来自文本和视觉输入的上下文线索来重建序列的中心动作。此外，我们通过对来自不同领域的混合动作序列（SeqMix）进行训练来增强模型的鲁棒性。我们在EGTEA和EPIC-KITCHENS-100数据集上验证了SeqDG。在epic - kitchen -100上的结果表明，SeqDG在不可见环境下的跨域动作识别上的相对平均提高了2.4%，在EGTEA上，该模型在域内动作识别上的准确率比SOTA高0.6%。代码和数据：Github.com/Ashayan97/SeqDG

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.