基于学习的基于空间连贯的鲁棒说话人计数和分离

IF 1.9 3区计算机科学

Journal on Audio Speech and Music Processing Pub Date : 2023-09-20 DOI:10.1186/s13636-023-00298-3

Yicheng Hsu, Mingsian R. Bai

{"title":"基于学习的基于空间连贯的鲁棒说话人计数和分离","authors":"Yicheng Hsu, Mingsian R. Bai","doi":"10.1186/s13636-023-00298-3","DOIUrl":null,"url":null,"abstract":"Abstract A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"13 1","pages":"0"},"PeriodicalIF":1.9000,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Learning-based robust speaker counting and separation with the aid of spatial coherence\",\"authors\":\"Yicheng Hsu, Mingsian R. Bai\",\"doi\":\"10.1186/s13636-023-00298-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.\",\"PeriodicalId\":49309,\"journal\":{\"name\":\"Journal on Audio Speech and Music Processing\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-023-00298-3\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Audio Speech and Music Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13636-023-00298-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

摘要:提出了一种用于嘈杂和混响环境下说话人计数和语音分离的三段式方法。在空间特征提取中，利用白化的相对传递函数计算空间相干矩阵(SCM)。利用单片机的特征向量构造一个单纯形来估计每个说话人的全局活动函数，而局部相干函数则由时频bin的wrtf与目标说话人的全局活动函数加权RTF之间的相干性来计算。在说话人计数中，我们使用单片机的特征值和两个说话人之间帧间全局活动分布的最大相似度作为说话人计数网络(SCnet)的输入特征。在说话人分离中，使用全局和局部活动驱动网络(GLADnet)来提取每个独立的说话人信号，这对高度重叠的语音信号特别有用。实际会议录音的实验结果表明，该系统在不需要事先了解阵列配置的情况下，取得了较好的发言者计数和发言者分离性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning-based robust speaker counting and separation with the aid of spatial coherence

Abstract A three-stage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a time-frequency bin and the global activity function-weighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activity-driven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal on Audio Speech and Music Processing Engineering-Electrical and Electronic Engineering

CiteScore

4.10

自引率

4.20%

发文量

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.