Dan Liu , Fanrong Meng , Jinpeng Mi , Mao Ye , Qingdu Li , Jianwei Zhang
{"title":"SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos","authors":"Dan Liu , Fanrong Meng , Jinpeng Mi , Mao Ye , Qingdu Li , Jianwei Zhang","doi":"10.1016/j.patcog.2025.111725","DOIUrl":null,"url":null,"abstract":"<div><div>The advent of affordable depth sensors has driven extensive research on human action recognition (HAR) in RGB-D videos. Existing unimodal approaches, such as skeleton-based or RGB video-based methods, have inherent limitations. For instance, the skeleton modality lacks spatial interaction, while the RGB video modality is highly susceptible to environmental noise. Additionally, multimodal action recognition often faces issues like insufficient data fusion and a substantial computational burden for temporal modeling. In this paper, we present an innovative Semantic-Assisted Multimodal Network (SAM-Net) for HAR in RGB-D videos. Firstly, we skillfully generate a SpatioTemporal Dynamic Region (STDR) image to instead of the RGB video modality by leveraging skeleton modality, thereby significantly reducing the video volume. Subsequently, we explore semantic information from large-scale VLMs, which effectively facilitates multimodal adaptation learning. Moreover, we implement an intramodal and intermodal multi-level fusion process for HAR. Finally, through extensive testing on three challenging datasets, our proposed SAM-Net showcases consistent state-of-the-art performance across various experimental configurations. Our codes will be released at <span><span>https://github.com/2233950316/code</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"168 ","pages":"Article 111725"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003851","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The advent of affordable depth sensors has driven extensive research on human action recognition (HAR) in RGB-D videos. Existing unimodal approaches, such as skeleton-based or RGB video-based methods, have inherent limitations. For instance, the skeleton modality lacks spatial interaction, while the RGB video modality is highly susceptible to environmental noise. Additionally, multimodal action recognition often faces issues like insufficient data fusion and a substantial computational burden for temporal modeling. In this paper, we present an innovative Semantic-Assisted Multimodal Network (SAM-Net) for HAR in RGB-D videos. Firstly, we skillfully generate a SpatioTemporal Dynamic Region (STDR) image to instead of the RGB video modality by leveraging skeleton modality, thereby significantly reducing the video volume. Subsequently, we explore semantic information from large-scale VLMs, which effectively facilitates multimodal adaptation learning. Moreover, we implement an intramodal and intermodal multi-level fusion process for HAR. Finally, through extensive testing on three challenging datasets, our proposed SAM-Net showcases consistent state-of-the-art performance across various experimental configurations. Our codes will be released at https://github.com/2233950316/code.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.