在 Adapt-FuseNet 上探索融合技术和可解释人工智能：上下文自适应融合人脸和步态以进行人员识别

IF 5

IEEE transactions on biometrics, behavior, and identity science Pub Date : 2024-03-27 DOI:10.1109/TBIOM.2024.3405081

Thejaswin S;Ashwin Prakash;Athira Nambiar;Alexandre Bernadino

{"title":"在 Adapt-FuseNet 上探索融合技术和可解释人工智能：上下文自适应融合人脸和步态以进行人员识别","authors":"Thejaswin S;Ashwin Prakash;Athira Nambiar;Alexandre Bernadino","doi":"10.1109/TBIOM.2024.3405081","DOIUrl":null,"url":null,"abstract":"Biometrics such as human gait and face play a significant role in vision-based surveillance applications. However, multimodal fusion of biometric features is a challenging task in non-controlled environments due to varying reliability of the features from different modalities in changing contexts, such as viewpoints, illuminations, occlusion, background clutter, and clothing. For instance, in person identification in the wild, facial and gait features play a complementary role, as, in principle, face provides more discriminatory features than gait if the person is frontal to the camera, while gait features are more discriminative in lateral views. Classical fusion techniques typically address this problem by explicitly computing in which context the data is obtained (e.g., frontal or lateral) and designing custom data fusion strategies for each context. However, this requires an initial enumeration of all the possible contexts and the design of context “detectors”, which bring their own challenges. Hence, how to effectively utilize both facial and gait information in arbitrary conditions is still an open problem. In this paper we present a context-adaptive multi-biometric fusion strategy that does not require the prior determination of context features; instead, the context is implicitly encoded in the fusion process by a set of attentional weights that encode the relevance of the different modalities for each particular data sample. The key contributions of the paper are threefold. First, we propose a novel framework for the dynamic fusion of multiple biometrics modalities leveraging attention techniques, denoted ‘Adapt-FuseNet’. Second, we perform an extensive evaluation of the proposed method in comparison to various other fusion techniques such as Bilinear Pooling, Parallel Co-attention, Keyless Attention, Multi-modal Factorized High-order Pooling, and Multimodal Tucker Fusion. Third, an Explainable Artificial Intelligence-based interpretation tool is used to analyse how the attention mechanism of ‘Adapt-FuseNet’ is capturing context implicitly and making the best weighting of the different modalities for the task at hand. This enables the interpretability of results in a more human-compliant way, hence boosting our confidence of the operation of AI systems in the wild. Extensive experiments are carried out on two public gait datasets (CASIA-A and CASIA-B), showing that ‘Adapt-FuseNet’ significantly outperforms the state-of-the-art.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"6 4","pages":"515-527"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Fusion Techniques and Explainable AI on Adapt-FuseNet: Context-Adaptive Fusion of Face and Gait for Person Identification\",\"authors\":\"Thejaswin S;Ashwin Prakash;Athira Nambiar;Alexandre Bernadino\",\"doi\":\"10.1109/TBIOM.2024.3405081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Biometrics such as human gait and face play a significant role in vision-based surveillance applications. However, multimodal fusion of biometric features is a challenging task in non-controlled environments due to varying reliability of the features from different modalities in changing contexts, such as viewpoints, illuminations, occlusion, background clutter, and clothing. For instance, in person identification in the wild, facial and gait features play a complementary role, as, in principle, face provides more discriminatory features than gait if the person is frontal to the camera, while gait features are more discriminative in lateral views. Classical fusion techniques typically address this problem by explicitly computing in which context the data is obtained (e.g., frontal or lateral) and designing custom data fusion strategies for each context. However, this requires an initial enumeration of all the possible contexts and the design of context “detectors”, which bring their own challenges. Hence, how to effectively utilize both facial and gait information in arbitrary conditions is still an open problem. In this paper we present a context-adaptive multi-biometric fusion strategy that does not require the prior determination of context features; instead, the context is implicitly encoded in the fusion process by a set of attentional weights that encode the relevance of the different modalities for each particular data sample. The key contributions of the paper are threefold. First, we propose a novel framework for the dynamic fusion of multiple biometrics modalities leveraging attention techniques, denoted ‘Adapt-FuseNet’. Second, we perform an extensive evaluation of the proposed method in comparison to various other fusion techniques such as Bilinear Pooling, Parallel Co-attention, Keyless Attention, Multi-modal Factorized High-order Pooling, and Multimodal Tucker Fusion. Third, an Explainable Artificial Intelligence-based interpretation tool is used to analyse how the attention mechanism of ‘Adapt-FuseNet’ is capturing context implicitly and making the best weighting of the different modalities for the task at hand. This enables the interpretability of results in a more human-compliant way, hence boosting our confidence of the operation of AI systems in the wild. Extensive experiments are carried out on two public gait datasets (CASIA-A and CASIA-B), showing that ‘Adapt-FuseNet’ significantly outperforms the state-of-the-art.\",\"PeriodicalId\":73307,\"journal\":{\"name\":\"IEEE transactions on biometrics, behavior, and identity science\",\"volume\":\"6 4\",\"pages\":\"515-527\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on biometrics, behavior, and identity science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10540048/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10540048/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人类步态和面部等生物识别技术在基于视觉的监控应用中发挥着重要作用。然而，在非受控环境中，生物识别特征的多模态融合是一项具有挑战性的任务，因为在视点、光照、遮挡、背景杂波和服装等不断变化的环境中，来自不同模态的特征的可靠性各不相同。例如，在野外的人员识别中，面部特征和步态特征起着互补作用，因为原则上，如果人物正面对着摄像机，面部特征比步态特征更有辨别力，而步态特征在侧视图中更有辨别力。经典的融合技术通常通过明确计算数据是在哪种情况下获得的（如正面或侧面），并为每种情况设计定制的数据融合策略来解决这一问题。然而，这需要对所有可能的上下文进行初步枚举，并设计上下文 "检测器"，这本身就带来了挑战。因此，如何在任意条件下有效利用面部和步态信息仍是一个未决问题。在本文中，我们提出了一种上下文自适应的多生物特征融合策略，它不需要事先确定上下文特征；相反，上下文在融合过程中由一组注意力权重隐式编码，这些权重编码了不同模态对每个特定数据样本的相关性。本文的主要贡献体现在三个方面。首先，我们提出了一个利用注意力技术动态融合多种生物识别模式的新框架，称为 "Adapt-FuseNet"。其次，我们对所提出的方法与其他各种融合技术进行了广泛的评估，如双线性集合、并行协同注意、无钥匙注意、多模态因子化高阶集合和多模态塔克融合。第三，使用基于可解释人工智能的解释工具来分析 "自适应融合网络 "的注意机制如何隐含地捕捉上下文，并为手头的任务对不同模态进行最佳加权。这样就能以更符合人类需求的方式解释结果，从而增强我们对人工智能系统在野生环境中运行的信心。我们在两个公共步态数据集（CASIA-A 和 CASIA-B）上进行了广泛的实验，结果表明 "Adapt-FuseNet "明显优于最先进的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Fusion Techniques and Explainable AI on Adapt-FuseNet: Context-Adaptive Fusion of Face and Gait for Person Identification

Biometrics such as human gait and face play a significant role in vision-based surveillance applications. However, multimodal fusion of biometric features is a challenging task in non-controlled environments due to varying reliability of the features from different modalities in changing contexts, such as viewpoints, illuminations, occlusion, background clutter, and clothing. For instance, in person identification in the wild, facial and gait features play a complementary role, as, in principle, face provides more discriminatory features than gait if the person is frontal to the camera, while gait features are more discriminative in lateral views. Classical fusion techniques typically address this problem by explicitly computing in which context the data is obtained (e.g., frontal or lateral) and designing custom data fusion strategies for each context. However, this requires an initial enumeration of all the possible contexts and the design of context “detectors”, which bring their own challenges. Hence, how to effectively utilize both facial and gait information in arbitrary conditions is still an open problem. In this paper we present a context-adaptive multi-biometric fusion strategy that does not require the prior determination of context features; instead, the context is implicitly encoded in the fusion process by a set of attentional weights that encode the relevance of the different modalities for each particular data sample. The key contributions of the paper are threefold. First, we propose a novel framework for the dynamic fusion of multiple biometrics modalities leveraging attention techniques, denoted ‘Adapt-FuseNet’. Second, we perform an extensive evaluation of the proposed method in comparison to various other fusion techniques such as Bilinear Pooling, Parallel Co-attention, Keyless Attention, Multi-modal Factorized High-order Pooling, and Multimodal Tucker Fusion. Third, an Explainable Artificial Intelligence-based interpretation tool is used to analyse how the attention mechanism of ‘Adapt-FuseNet’ is capturing context implicitly and making the best weighting of the different modalities for the task at hand. This enables the interpretability of results in a more human-compliant way, hence boosting our confidence of the operation of AI systems in the wild. Extensive experiments are carried out on two public gait datasets (CASIA-A and CASIA-B), showing that ‘Adapt-FuseNet’ significantly outperforms the state-of-the-art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on biometrics, behavior, and identity science

CiteScore

10.90

自引率

0.00%

发文量