Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença
{"title":"WASD: A Wilder Active Speaker Detection Dataset","authors":"Tiago Roxo;Joana Cabral Costa;Pedro R. M. Inácio;Hugo Proença","doi":"10.1109/TBIOM.2024.3412821","DOIUrl":null,"url":null,"abstract":"Current Active Speaker Detection (ASD) models achieve good results on cooperative settings with reliable face access using only sound and facial features, which is not suited for less constrained conditions. To demonstrate this limitation of current datasets, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the key components of current ASD: audio and face. Grouped into 5 categories, WASD contains incremental challenges for ASD with tactical impairment of audio and face data, and provides a new source for ASD via subject body annotations. To highlight the new challenges of WASD, we divide it into Easy (cooperative settings) and Hard (audio and/or face are specifically degraded) groups, and assess state-of-the-art models performance in WASD and in the most challenging available ASD dataset: AVA-ActiveSpeaker. The results show that: 1) AVA-ActiveSpeaker prepares models for cooperative settings but not wilder ones (surveillance); and 2) current ASD approaches can not reliably perform in wilder settings, even if trained with challenging data. To prove the importance of body for wild ASD, we propose a baseline that complements body with face and audio information that surpass state-of-the-art models in WASD and Columbia. All contributions are available at \n<uri>https://github.com/Tiago-Roxo/WASD</uri>\n.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 1","pages":"61-70"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10554644/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Current Active Speaker Detection (ASD) models achieve good results on cooperative settings with reliable face access using only sound and facial features, which is not suited for less constrained conditions. To demonstrate this limitation of current datasets, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the key components of current ASD: audio and face. Grouped into 5 categories, WASD contains incremental challenges for ASD with tactical impairment of audio and face data, and provides a new source for ASD via subject body annotations. To highlight the new challenges of WASD, we divide it into Easy (cooperative settings) and Hard (audio and/or face are specifically degraded) groups, and assess state-of-the-art models performance in WASD and in the most challenging available ASD dataset: AVA-ActiveSpeaker. The results show that: 1) AVA-ActiveSpeaker prepares models for cooperative settings but not wilder ones (surveillance); and 2) current ASD approaches can not reliably perform in wilder settings, even if trained with challenging data. To prove the importance of body for wild ASD, we propose a baseline that complements body with face and audio information that surpass state-of-the-art models in WASD and Columbia. All contributions are available at
https://github.com/Tiago-Roxo/WASD
.