Modality mixer exploiting complementary information for multi-modal action recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-04-03 DOI:10.1016/j.cviu.2025.104358

Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim

{"title":"Modality mixer exploiting complementary information for multi-modal action recognition","authors":"Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim","doi":"10.1016/j.cviu.2025.104358","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (<em>e</em>.<em>g</em>., RGB) with action content features of other modalities (<em>e</em>.<em>g</em>., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates separate learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104358"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000815","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates separate learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.

查看原文本刊更多论文

利用互补信息进行多模态动作识别的模态混合器

由于传感器的独特特性，每种模态都表现出独特的物理特性。因此，在多模态动作识别的背景下，不仅要考虑整体动作内容，还要考虑不同模态的互补性。在本文中，我们提出了一种新的网络，称为模态混合（M-Mixer）网络，它有效地利用并融合了跨模态的互补信息和动作的时间背景，以进行动作识别。我们提出的M-Mixer的一个关键组件是多模态上下文化单元（MCU），这是一个简单而有效的循环单元。我们的MCU负责暂时编码一个模态序列（例如，RGB）与其他模态（例如，深度和红外模态）的动作内容特征。这一过程鼓励M-Mixer网络开发全球行动内容，并补充其他形式的补充信息。此外，为了根据给定的模态设置提取适当的互补信息，我们引入了一个新的模块，称为互补特征提取模块（CFEM）。CFEM为每个模态引入了单独的可学习查询嵌入，指导CFEM从其他模态中提取互补信息和全局动作内容。因此，我们提出的方法在NTU RGB+D 60、NTU RGB+D 120和NW-UCLA数据集上优于最先进的方法。此外，通过综合烧蚀研究，我们进一步验证了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems