Bo Wu, Meng Yu, Lianwu Chen, Mingjie Jin, Dan Su, Dong Yu
{"title":"Improving Speech Enhancement with Phonetic Embedding Features","authors":"Bo Wu, Meng Yu, Lianwu Chen, Mingjie Jin, Dan Su, Dong Yu","doi":"10.1109/ASRU46091.2019.9003987","DOIUrl":null,"url":null,"abstract":"In this paper, we present a speech enhancement framework that leverages phonetic information obtained from the acoustic model. It consists of two separate components: (i) a long short-term memory recurrent neural network (LSTM-RNN) based speech enhancement model that takes the combination of log-power spectra (LPS) and phonetic embedding features as input to predict the complex ideal ratio mask (cIRM); and (ii) a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model that extracts the phonetic feature vector in the hidden units of its LSTM layer. Our experimental results show that the proposed framework outperforms both the conventional and phoneme-dependent speech enhancement systems under various noisy conditions, generalizes well to unseen conditions, and performs robustly to the speech interference. We further demonstrate its superior enhancement performance on unvoiced speech and report a preliminary yet promising recognition experiment on real test data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this paper, we present a speech enhancement framework that leverages phonetic information obtained from the acoustic model. It consists of two separate components: (i) a long short-term memory recurrent neural network (LSTM-RNN) based speech enhancement model that takes the combination of log-power spectra (LPS) and phonetic embedding features as input to predict the complex ideal ratio mask (cIRM); and (ii) a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model that extracts the phonetic feature vector in the hidden units of its LSTM layer. Our experimental results show that the proposed framework outperforms both the conventional and phoneme-dependent speech enhancement systems under various noisy conditions, generalizes well to unseen conditions, and performs robustly to the speech interference. We further demonstrate its superior enhancement performance on unvoiced speech and report a preliminary yet promising recognition experiment on real test data.