用于单通道说话人注意系统的LSTM自编码器

2019 9th International Conference on Computer and Knowledge Engineering (ICCKE) Pub Date : 2019-10-01 DOI:10.1109/ICCKE48569.2019.8965084

Mahnaz Rahmani, F. Razzazi

{"title":"用于单通道说话人注意系统的LSTM自编码器","authors":"Mahnaz Rahmani, F. Razzazi","doi":"10.1109/ICCKE48569.2019.8965084","DOIUrl":null,"url":null,"abstract":"In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.","PeriodicalId":6685,"journal":{"name":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"1 1","pages":"110-115"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An LSTM Auto-Encoder for Single-Channel Speaker Attention System\",\"authors\":\"Mahnaz Rahmani, F. Razzazi\",\"doi\":\"10.1109/ICCKE48569.2019.8965084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.\",\"PeriodicalId\":6685,\"journal\":{\"name\":\"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)\",\"volume\":\"1 1\",\"pages\":\"110-115\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCKE48569.2019.8965084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE48569.2019.8965084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们利用一组长短期记忆(LSTM)深度神经网络来区分单通道语音中的特定说话者和其他说话者。对该网络的结构进行了修改，以获得合适的结果。该体系结构将每帧中的光谱数据序列作为关键特征进行建模。每个网络有两个存储单元，并接受8波段光谱窗口作为输入。将不同波段的重建结果合并，重建说话人的话语。我们用PESQ和MSE指标评估了所提出系统的预期说话人重建性能。使用TIMIT数据集中每个说话人的所有话语作为训练数据，构建基于LSTM的注意力自编码器模型，我们的PESQ测量值达到3.66，重建目标说话人。相比之下，当我们使用上述说话者的网络时，其他说话者的PESQ平均值为1.92。这个测试成功地重复了不同说话者的不同话语。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An LSTM Auto-Encoder for Single-Channel Speaker Attention System

In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)

自引率

0.00%

发文量