Yan Sun;Xueling Feng;Xiaolei Liu;Liyan Ma;Long Hu;Mark S. Nixon
{"title":"TriGait: Hybrid Fusion Strategy for Multimodal Alignment and Integration in Gait Recognition","authors":"Yan Sun;Xueling Feng;Xiaolei Liu;Liyan Ma;Long Hu;Mark S. Nixon","doi":"10.1109/TBIOM.2024.3435046","DOIUrl":null,"url":null,"abstract":"Due to the inherent limitations of single modalities, multimodal fusion has become increasingly popular in many computer vision fields, leveraging the complementary advantages of unimodal methods. As an emerging biometric technology with great application potential, gait recognition faces similar challenges. The prevailing silhouette-based and skeleton-based gait recognition methods have their respective limitations: one focuses on appearance information while neglecting structural details, and the other does the opposite. Multimodal gait recognition, which combines silhouette and skeleton, promises more robust predictions. However, it is essential and difficult to explore the implicit interaction between dense pixels and discrete coordinate points. Most existing multimodal gait recognition methods basically concatenated features from silhouette and skeleton and did not fully exploit complementarity between them. This paper presents a hybrid fusion strategy called TriGait, which is a three-branch structural model and thoroughly explores the interaction and complementarity of the two modalities. To solve the problem of data heterogeneity and explore the mutual information of two modalities, we propose the use of a cross-modal token generator (CMTG) within a fusion branch to align and fuse the low-level features of the two modalities. Additionally, TriGait has two extra branches for extracting high-level semantic information from silhouette and skeleton. By combining low-level correlation information and high-level semantic information, TriGait provides a comprehensive and discriminative representation of a subject’s gait. Extensive experimental results on CASIA-B, Gait3D and OUMVLP demonstrate the effectiveness of TriGait. Remarkably, TriGait achieves the rank-1 mean accuracy of 96.6%, 61.4% and 91.1% on CASIA-B, Gait3D and OUMVLP respectively, outperforming the state-of-the-art methods. The source code is available at: \n<uri>https://github.com/YanSun-github/TriGait/</uri>\n.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 1","pages":"82-94"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10612818/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Due to the inherent limitations of single modalities, multimodal fusion has become increasingly popular in many computer vision fields, leveraging the complementary advantages of unimodal methods. As an emerging biometric technology with great application potential, gait recognition faces similar challenges. The prevailing silhouette-based and skeleton-based gait recognition methods have their respective limitations: one focuses on appearance information while neglecting structural details, and the other does the opposite. Multimodal gait recognition, which combines silhouette and skeleton, promises more robust predictions. However, it is essential and difficult to explore the implicit interaction between dense pixels and discrete coordinate points. Most existing multimodal gait recognition methods basically concatenated features from silhouette and skeleton and did not fully exploit complementarity between them. This paper presents a hybrid fusion strategy called TriGait, which is a three-branch structural model and thoroughly explores the interaction and complementarity of the two modalities. To solve the problem of data heterogeneity and explore the mutual information of two modalities, we propose the use of a cross-modal token generator (CMTG) within a fusion branch to align and fuse the low-level features of the two modalities. Additionally, TriGait has two extra branches for extracting high-level semantic information from silhouette and skeleton. By combining low-level correlation information and high-level semantic information, TriGait provides a comprehensive and discriminative representation of a subject’s gait. Extensive experimental results on CASIA-B, Gait3D and OUMVLP demonstrate the effectiveness of TriGait. Remarkably, TriGait achieves the rank-1 mean accuracy of 96.6%, 61.4% and 91.1% on CASIA-B, Gait3D and OUMVLP respectively, outperforming the state-of-the-art methods. The source code is available at:
https://github.com/YanSun-github/TriGait/
.