用于单通道说话人注意系统的LSTM自编码器

Mahnaz Rahmani, F. Razzazi
{"title":"用于单通道说话人注意系统的LSTM自编码器","authors":"Mahnaz Rahmani, F. Razzazi","doi":"10.1109/ICCKE48569.2019.8965084","DOIUrl":null,"url":null,"abstract":"In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.","PeriodicalId":6685,"journal":{"name":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"1 1","pages":"110-115"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An LSTM Auto-Encoder for Single-Channel Speaker Attention System\",\"authors\":\"Mahnaz Rahmani, F. Razzazi\",\"doi\":\"10.1109/ICCKE48569.2019.8965084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.\",\"PeriodicalId\":6685,\"journal\":{\"name\":\"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)\",\"volume\":\"1 1\",\"pages\":\"110-115\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCKE48569.2019.8965084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE48569.2019.8965084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在本文中,我们利用一组长短期记忆(LSTM)深度神经网络来区分单通道语音中的特定说话者和其他说话者。对该网络的结构进行了修改,以获得合适的结果。该体系结构将每帧中的光谱数据序列作为关键特征进行建模。每个网络有两个存储单元,并接受8波段光谱窗口作为输入。将不同波段的重建结果合并,重建说话人的话语。我们用PESQ和MSE指标评估了所提出系统的预期说话人重建性能。使用TIMIT数据集中每个说话人的所有话语作为训练数据,构建基于LSTM的注意力自编码器模型,我们的PESQ测量值达到3.66,重建目标说话人。相比之下,当我们使用上述说话者的网络时,其他说话者的PESQ平均值为1.92。这个测试成功地重复了不同说话者的不同话语。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An LSTM Auto-Encoder for Single-Channel Speaker Attention System
In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信