Cracking the cocktail party problem by multi-beam deep attractor network

Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Y. Gong
{"title":"Cracking the cocktail party problem by multi-beam deep attractor network","authors":"Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Y. Gong","doi":"10.1109/ASRU.2017.8268969","DOIUrl":null,"url":null,"abstract":"While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76%, 59.40% and 62.80% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34

Abstract

While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76%, 59.40% and 62.80% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.
利用多波束深度吸引子网络解决鸡尾酒会问题
虽然最近神经网络方法在单通道语音分离或更一般的鸡尾酒会问题上取得了显著的进步,但它们在复杂混合情况下的表现仍然不令人满意。在这项工作中,我们提出了一种新的多通道框架,用于多话务员分离。在该模型中,输入的多通道混合信号首先被转换成一组使用固定波束模式的波束形成信号。对于这种波束形成,我们建议使用差分波束形成器,因为它们更适合于语音分离。然后将每个波束形成的信号送入单通道锚定的深度吸引子网络,以产生分离的信号。通过后置选择各梁的分离输出来获得最终的分离。为了评估提出的系统,我们创建了一个具有挑战性的数据集,包括2、3或4个扬声器的混合物。我们的研究结果表明,所提出的系统在很大程度上提高了语音分离的技术水平,在4、3和2个重叠扬声器混合物中实现了11.5 dB、11.76 dB和11.02 dB的平均信失真比提高,这与使用oracle位置、源和噪声信息的最小方差无失真响应波束形成器的性能相当。我们还对分离的语音进行了干净训练的声学模型的语音识别,在4人、3人和2人的完全重叠语音上,相对单词错误率(WER)分别降低了45.76%、59.40%和62.80%。采用远腔声学模型,进一步减小了噪声干扰。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信