{"title":"Exploring Fusion Techniques and Explainable AI on Adapt-FuseNet: Context-Adaptive Fusion of Face and Gait for Person Identification","authors":"Thejaswin S;Ashwin Prakash;Athira Nambiar;Alexandre Bernadino","doi":"10.1109/TBIOM.2024.3405081","DOIUrl":null,"url":null,"abstract":"Biometrics such as human gait and face play a significant role in vision-based surveillance applications. However, multimodal fusion of biometric features is a challenging task in non-controlled environments due to varying reliability of the features from different modalities in changing contexts, such as viewpoints, illuminations, occlusion, background clutter, and clothing. For instance, in person identification in the wild, facial and gait features play a complementary role, as, in principle, face provides more discriminatory features than gait if the person is frontal to the camera, while gait features are more discriminative in lateral views. Classical fusion techniques typically address this problem by explicitly computing in which context the data is obtained (e.g., frontal or lateral) and designing custom data fusion strategies for each context. However, this requires an initial enumeration of all the possible contexts and the design of context “detectors”, which bring their own challenges. Hence, how to effectively utilize both facial and gait information in arbitrary conditions is still an open problem. In this paper we present a context-adaptive multi-biometric fusion strategy that does not require the prior determination of context features; instead, the context is implicitly encoded in the fusion process by a set of attentional weights that encode the relevance of the different modalities for each particular data sample. The key contributions of the paper are threefold. First, we propose a novel framework for the dynamic fusion of multiple biometrics modalities leveraging attention techniques, denoted ‘Adapt-FuseNet’. Second, we perform an extensive evaluation of the proposed method in comparison to various other fusion techniques such as Bilinear Pooling, Parallel Co-attention, Keyless Attention, Multi-modal Factorized High-order Pooling, and Multimodal Tucker Fusion. Third, an Explainable Artificial Intelligence-based interpretation tool is used to analyse how the attention mechanism of ‘Adapt-FuseNet’ is capturing context implicitly and making the best weighting of the different modalities for the task at hand. This enables the interpretability of results in a more human-compliant way, hence boosting our confidence of the operation of AI systems in the wild. Extensive experiments are carried out on two public gait datasets (CASIA-A and CASIA-B), showing that ‘Adapt-FuseNet’ significantly outperforms the state-of-the-art.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"6 4","pages":"515-527"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10540048/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Biometrics such as human gait and face play a significant role in vision-based surveillance applications. However, multimodal fusion of biometric features is a challenging task in non-controlled environments due to varying reliability of the features from different modalities in changing contexts, such as viewpoints, illuminations, occlusion, background clutter, and clothing. For instance, in person identification in the wild, facial and gait features play a complementary role, as, in principle, face provides more discriminatory features than gait if the person is frontal to the camera, while gait features are more discriminative in lateral views. Classical fusion techniques typically address this problem by explicitly computing in which context the data is obtained (e.g., frontal or lateral) and designing custom data fusion strategies for each context. However, this requires an initial enumeration of all the possible contexts and the design of context “detectors”, which bring their own challenges. Hence, how to effectively utilize both facial and gait information in arbitrary conditions is still an open problem. In this paper we present a context-adaptive multi-biometric fusion strategy that does not require the prior determination of context features; instead, the context is implicitly encoded in the fusion process by a set of attentional weights that encode the relevance of the different modalities for each particular data sample. The key contributions of the paper are threefold. First, we propose a novel framework for the dynamic fusion of multiple biometrics modalities leveraging attention techniques, denoted ‘Adapt-FuseNet’. Second, we perform an extensive evaluation of the proposed method in comparison to various other fusion techniques such as Bilinear Pooling, Parallel Co-attention, Keyless Attention, Multi-modal Factorized High-order Pooling, and Multimodal Tucker Fusion. Third, an Explainable Artificial Intelligence-based interpretation tool is used to analyse how the attention mechanism of ‘Adapt-FuseNet’ is capturing context implicitly and making the best weighting of the different modalities for the task at hand. This enables the interpretability of results in a more human-compliant way, hence boosting our confidence of the operation of AI systems in the wild. Extensive experiments are carried out on two public gait datasets (CASIA-A and CASIA-B), showing that ‘Adapt-FuseNet’ significantly outperforms the state-of-the-art.