多通道重叠语音识别与位置引导语音提取网络

Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, Jinyu Li, Y. Gong
{"title":"多通道重叠语音识别与位置引导语音提取网络","authors":"Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, Jinyu Li, Y. Gong","doi":"10.1109/SLT.2018.8639593","DOIUrl":null,"url":null,"abstract":"Although advances in close-talk speech recognition have resulted in relatively low error rates, the recognition performance in far-field environments is still limited due to low signal-to-noise ratio, reverberation, and overlapped speech from simultaneous speakers which is especially more difficult. To solve these problems, beamforming and speech separation networks were previously proposed. However, they tend to suffer from leakage of interfering speech or limited generalizability. In this work, we propose a simple yet effective method for multi-channel far-field overlapped speech recognition. In the proposed system, three different features are formed for each target speaker, namely, spectral, spatial, and angle features. Then a neural network is trained using all features with a target of the clean speech of the required speaker. An iterative update procedure is proposed in which the mask-based beamforming and mask estimation are performed alternatively. The proposed system were evaluated with real recorded meetings with different levels of overlapping ratios. The results show that the proposed system achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Moreover, as overlap ratio rises from 20% to 70+%, only 3.8% WER increase is observed for the proposed system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":"{\"title\":\"Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network\",\"authors\":\"Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, Jinyu Li, Y. Gong\",\"doi\":\"10.1109/SLT.2018.8639593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although advances in close-talk speech recognition have resulted in relatively low error rates, the recognition performance in far-field environments is still limited due to low signal-to-noise ratio, reverberation, and overlapped speech from simultaneous speakers which is especially more difficult. To solve these problems, beamforming and speech separation networks were previously proposed. However, they tend to suffer from leakage of interfering speech or limited generalizability. In this work, we propose a simple yet effective method for multi-channel far-field overlapped speech recognition. In the proposed system, three different features are formed for each target speaker, namely, spectral, spatial, and angle features. Then a neural network is trained using all features with a target of the clean speech of the required speaker. An iterative update procedure is proposed in which the mask-based beamforming and mask estimation are performed alternatively. The proposed system were evaluated with real recorded meetings with different levels of overlapping ratios. The results show that the proposed system achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Moreover, as overlap ratio rises from 20% to 70+%, only 3.8% WER increase is observed for the proposed system.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"109\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 109

摘要

尽管近距离语音识别技术的进步导致了相对较低的错误率,但远场环境中的识别性能仍然受到低信噪比、混响和同时说话人的语音重叠的限制,这尤其困难。为了解决这些问题,之前提出了波束形成和语音分离网络。然而,它们往往存在干扰性言语泄漏或泛化能力有限的问题。在这项工作中,我们提出了一种简单有效的多通道远场重叠语音识别方法。在该系统中,每个目标说话人形成三个不同的特征,即频谱特征、空间特征和角度特征。然后使用所有特征训练神经网络,目标是所需说话人的干净语音。提出了一种基于掩模的波束形成和掩模估计交替进行的迭代更新方法。用不同重叠比率的真实记录会议来评价拟议的制度。结果表明,该系统比采用oracle选择的固定波束形成系统的相对字错误率降低了24%以上。此外,当重叠比从20%增加到70%以上时,所提出的系统的WER仅增加3.8%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network
Although advances in close-talk speech recognition have resulted in relatively low error rates, the recognition performance in far-field environments is still limited due to low signal-to-noise ratio, reverberation, and overlapped speech from simultaneous speakers which is especially more difficult. To solve these problems, beamforming and speech separation networks were previously proposed. However, they tend to suffer from leakage of interfering speech or limited generalizability. In this work, we propose a simple yet effective method for multi-channel far-field overlapped speech recognition. In the proposed system, three different features are formed for each target speaker, namely, spectral, spatial, and angle features. Then a neural network is trained using all features with a target of the clean speech of the required speaker. An iterative update procedure is proposed in which the mask-based beamforming and mask estimation are performed alternatively. The proposed system were evaluated with real recorded meetings with different levels of overlapping ratios. The results show that the proposed system achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Moreover, as overlap ratio rises from 20% to 70+%, only 3.8% WER increase is observed for the proposed system.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信