使用扬声器嵌入的自组织麦克风聚类的鲁棒性:在现实和具有挑战性的场景下的评估

IF 1.9 3区计算机科学

Journal on Audio Speech and Music Processing Pub Date : 2023-10-31 DOI:10.1186/s13636-023-00310-w

Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu

{"title":"使用扬声器嵌入的自组织麦克风聚类的鲁棒性:在现实和具有挑战性的场景下的评估","authors":"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu","doi":"10.1186/s13636-023-00310-w","DOIUrl":null,"url":null,"abstract":"Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"15 5","pages":"0"},"PeriodicalIF":1.9000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios\",\"authors\":\"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu\",\"doi\":\"10.1186/s13636-023-00310-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.\",\"PeriodicalId\":49309,\"journal\":{\"name\":\"Journal on Audio Speech and Music Processing\",\"volume\":\"15 5\",\"pages\":\"0\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-023-00310-w\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Audio Speech and Music Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13636-023-00310-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

摘要基于ECAPA-TDNN的扬声器验证网络中的扬声器嵌入是最近被引入的一种特征，用于在ad hoc阵列中对麦克风进行聚类。我们之前的工作表明，与基于信号的Mod-MFCC功能相比，使用扬声器嵌入在感兴趣的源周围产生了更健壮和更逻辑的麦克风聚类。这项工作旨在通过解决我们之前工作中产生的开放和额外的实际问题，进一步建立扬声器嵌入作为临时麦克风聚类的强大功能。具体来说，虽然我们最初的工作是利用基于鞋盒声学模型的模拟数据，但我们现在在更现实的环境中进行了更彻底的分析。此外，我们还研究了其他重要的考虑因素，如模糊c均值聚类中使用的距离度量的选择;需要聚合数据以获得健壮集群的最小时间范围;在越来越具有挑战性的情况下，以及在多个扬声器的情况下，这些功能的表现。我们还对比了量化这种特别集群质量的几个指标的基础上的结果。结果表明，说话人嵌入对较短的推理时间具有鲁棒性，并且即使在源彼此非常接近时也能提供逻辑和有用的聚类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios

Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal on Audio Speech and Music Processing Engineering-Electrical and Electronic Engineering

CiteScore

4.10

自引率

4.20%

发文量

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.