BIAS: A Body-Based Interpretable Active Speaker Approach

IF 5
Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença
{"title":"BIAS: A Body-Based Interpretable Active Speaker Approach","authors":"Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença","doi":"10.1109/TBIOM.2024.3520030","DOIUrl":null,"url":null,"abstract":"State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at <uri>https://github.com/Tiago-Roxo/BIAS</uri>.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 3","pages":"410-421"},"PeriodicalIF":5.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10806889/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.
偏见:基于肢体可译的主动说话者方法
最先进的主动说话者检测(ASD)方法严重依赖于音频和面部特征来执行,这在野外场景中不是一种可持续的方法。尽管这些方法在标准的AVA-ActiveSpeaker数据集中取得了很好的结果,但最近更广泛的ASD数据集(WASD)显示了这些模型的局限性,并提出了对新方法的需求。因此,我们提出了BIAS,这是一个首次结合音频,面部和身体信息的模型,可以在不同/具有挑战性的条件下准确预测主动说话者。此外,我们通过提出挤压和激励块的新用途来设计BIAS以提供可解释性,即在注意力热图创建和特征重要性评估中。为了实现完整的可解释性设置,我们注释了一个与自闭症相关的动作数据集(自闭症文本),以微调文本场景描述的ViT-GPT2,以补充BIAS的可解释性。结果表明,在具有挑战性的条件下,基于身体的特征是最重要的(Columbia,开放式设置和WASD), BIAS是最先进的,并且在ava - activesspeaker中产生竞争结果,其中面部比身体对ASD的影响更大。BIAS可解释性还显示了在不同环境下与ASD预测更相关的特征/方面,使其成为进一步开发可解释ASD模型的有力基线,并可在https://github.com/Tiago-Roxo/BIAS上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
10.90
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信