BIAS: A Body-Based Interpretable Active Speaker Approach

IF 5

IEEE transactions on biometrics, behavior, and identity science Pub Date : 2024-12-18 DOI:10.1109/TBIOM.2024.3520030

Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença

{"title":"BIAS: A Body-Based Interpretable Active Speaker Approach","authors":"Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença","doi":"10.1109/TBIOM.2024.3520030","DOIUrl":null,"url":null,"abstract":"State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at <uri>https://github.com/Tiago-Roxo/BIAS</uri>.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 3","pages":"410-421"},"PeriodicalIF":5.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10806889/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.

查看原文本刊更多论文

偏见：基于肢体可译的主动说话者方法

最先进的主动说话者检测（ASD）方法严重依赖于音频和面部特征来执行，这在野外场景中不是一种可持续的方法。尽管这些方法在标准的AVA-ActiveSpeaker数据集中取得了很好的结果，但最近更广泛的ASD数据集（WASD）显示了这些模型的局限性，并提出了对新方法的需求。因此，我们提出了BIAS，这是一个首次结合音频，面部和身体信息的模型，可以在不同/具有挑战性的条件下准确预测主动说话者。此外，我们通过提出挤压和激励块的新用途来设计BIAS以提供可解释性，即在注意力热图创建和特征重要性评估中。为了实现完整的可解释性设置，我们注释了一个与自闭症相关的动作数据集（自闭症文本），以微调文本场景描述的ViT-GPT2，以补充BIAS的可解释性。结果表明，在具有挑战性的条件下，基于身体的特征是最重要的（Columbia，开放式设置和WASD）， BIAS是最先进的，并且在ava - activesspeaker中产生竞争结果，其中面部比身体对ASD的影响更大。BIAS可解释性还显示了在不同环境下与ASD预测更相关的特征/方面，使其成为进一步开发可解释ASD模型的有力基线，并可在https://github.com/Tiago-Roxo/BIAS上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on biometrics, behavior, and identity science

CiteScore

10.90

自引率

0.00%

发文量