{"title":"An LSTM Auto-Encoder for Single-Channel Speaker Attention System","authors":"Mahnaz Rahmani, F. Razzazi","doi":"10.1109/ICCKE48569.2019.8965084","DOIUrl":null,"url":null,"abstract":"In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.","PeriodicalId":6685,"journal":{"name":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"1 1","pages":"110-115"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE48569.2019.8965084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.