Learning speaker representation for neural network based multichannel speaker extraction

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268910

Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani

{"title":"Learning speaker representation for neural network based multichannel speaker extraction","authors":"Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani","doi":"10.1109/ASRU.2017.8268910","DOIUrl":null,"url":null,"abstract":"Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.

查看原文本刊更多论文

学习基于神经网络的多通道说话人提取的说话人表示

近年来，利用深度神经网络(dnn)从噪声观测中提取语音的方案在噪声鲁棒性自动语音识别方面显示出巨大的潜力。然而，当干扰噪声是另一个扬声器时，这些方案不太适合。为了能够从混合说话人中提取目标说话人，我们最近提出使用从同一说话人的自适应话语中提取的说话人信息来通知神经网络。在我们之前的工作中，我们探索了如何告知网络关于说话人的方法，并找到了一个适合于这项任务的说话人自适应层方法。在我们的实验中，我们使用了为说话人识别任务设计的说话人特征作为额外的说话人信息，这对于说话人提取任务来说可能不是最优的。在本文中，我们提出了一种能够与网络共同学习说话人表征的序列总结方案。此外，我们扩展了之前的实验，以证明我们提出的方法作为语音识别前端的潜力，并探索附加噪声对方法性能的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量