Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Fang-Lue Zhang, Shen Zhao, Mengyuan Liu
{"title":"探索人类对动作识别的解析方式","authors":"Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Fang-Lue Zhang, Shen Zhao, Mengyuan Liu","doi":"10.1049/cit2.12366","DOIUrl":null,"url":null,"abstract":"<p>Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual-branch framework called ensemble human parsing and pose network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP-Net-Action.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"9 6","pages":"1623-1633"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12366","citationCount":"0","resultStr":"{\"title\":\"Explore human parsing modality for action recognition\",\"authors\":\"Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Fang-Lue Zhang, Shen Zhao, Mengyuan Liu\",\"doi\":\"10.1049/cit2.12366\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual-branch framework called ensemble human parsing and pose network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP-Net-Action.</p>\",\"PeriodicalId\":46211,\"journal\":{\"name\":\"CAAI Transactions on Intelligence Technology\",\"volume\":\"9 6\",\"pages\":\"1623-1633\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12366\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CAAI Transactions on Intelligence Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12366\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12366","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
基于姿态和RGB模态的多模态动作识别方法取得了很大的成功。然而,骨骼序列缺乏外观描述,RGB图像由于模态限制而受到无关噪声的影响。为了解决这个问题,作者引入了人类解析特征映射作为一种新的模式,因为它可以选择性地保留身体部位的有效语义特征,同时过滤掉大多数不相关的噪声。作者提出了一个新的双分支框架,称为集成人类解析和姿态网络(EPP-Net),这是第一个利用骨骼和人类解析模式进行动作识别的框架。第一个人类姿态分支在图卷积网络中提供鲁棒骨架来建模姿态特征,而第二个人类解析分支还利用描述性解析特征映射通过卷积主干来建模解析特征。通过后期融合策略将这两个高级特征有效地结合起来,以获得更好的动作识别。在NTU RGB + D和NTU RGB + D 120基准上的大量实验一致验证了我们提出的EPP-Net的有效性,其优于现有的动作识别方法。我们的代码可在https://github.com/liujf69/EPP-Net-Action上获得。
Explore human parsing modality for action recognition
Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual-branch framework called ensemble human parsing and pose network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP-Net-Action.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.