Changyan Zheng, Jibin Yang, Xiongwei Zhang, Meng Sun, Kun Yao
{"title":"利用结构相似度损失函数改进骨传导语音的频谱恢复","authors":"Changyan Zheng, Jibin Yang, Xiongwei Zhang, Meng Sun, Kun Yao","doi":"10.1109/APSIPAASC47483.2019.9023226","DOIUrl":null,"url":null,"abstract":"Bone-conducted (BC) speech is immune to background noise, but suffers from low speech quality due to the severe loss of high-frequency components. The key to BC speech enhancement is to restore the missing parts in the spectra. However, even with advanced deep neural networks (DNN), some of the recovered components still lack expected spectro-temproal structures. Mean Square Error loss function (MSE) is the typical choice for supervised DNN training, but it can only measure the distance of the spectro-temporal points and is not able to evaluate the similarity of structures. In this paper, Structural SIMilarity loss function (SSIM) originated from image quality assessment is proposed to train the spectral mapping model in BC speech enhancement, and to our best knowledge, it is the first time that SSIM is deployed in DNN- based speech signal processing tasks. Experimental results show that compared with MSE, SSIM can acquire better objective results and obtain spectra with spectro-temporal structures more similar to the target one. Some adjustments of hyper-parameters in SSIM are made due to the difference between natural image and magnitude spectrogram, and the optimal choice of them are suggested. In addition, the effects of three components in SSIM are analyzed individually, aiming to help further study on the applications of this loss function in other speech signal processing tasks.","PeriodicalId":145222,"journal":{"name":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"34 12","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improving the Spectra Recovering of Bone-Conducted Speech via Structural SIMilarity Loss Function\",\"authors\":\"Changyan Zheng, Jibin Yang, Xiongwei Zhang, Meng Sun, Kun Yao\",\"doi\":\"10.1109/APSIPAASC47483.2019.9023226\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bone-conducted (BC) speech is immune to background noise, but suffers from low speech quality due to the severe loss of high-frequency components. The key to BC speech enhancement is to restore the missing parts in the spectra. However, even with advanced deep neural networks (DNN), some of the recovered components still lack expected spectro-temproal structures. Mean Square Error loss function (MSE) is the typical choice for supervised DNN training, but it can only measure the distance of the spectro-temporal points and is not able to evaluate the similarity of structures. In this paper, Structural SIMilarity loss function (SSIM) originated from image quality assessment is proposed to train the spectral mapping model in BC speech enhancement, and to our best knowledge, it is the first time that SSIM is deployed in DNN- based speech signal processing tasks. Experimental results show that compared with MSE, SSIM can acquire better objective results and obtain spectra with spectro-temporal structures more similar to the target one. Some adjustments of hyper-parameters in SSIM are made due to the difference between natural image and magnitude spectrogram, and the optimal choice of them are suggested. In addition, the effects of three components in SSIM are analyzed individually, aiming to help further study on the applications of this loss function in other speech signal processing tasks.\",\"PeriodicalId\":145222,\"journal\":{\"name\":\"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"34 12\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSIPAASC47483.2019.9023226\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPAASC47483.2019.9023226","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving the Spectra Recovering of Bone-Conducted Speech via Structural SIMilarity Loss Function
Bone-conducted (BC) speech is immune to background noise, but suffers from low speech quality due to the severe loss of high-frequency components. The key to BC speech enhancement is to restore the missing parts in the spectra. However, even with advanced deep neural networks (DNN), some of the recovered components still lack expected spectro-temproal structures. Mean Square Error loss function (MSE) is the typical choice for supervised DNN training, but it can only measure the distance of the spectro-temporal points and is not able to evaluate the similarity of structures. In this paper, Structural SIMilarity loss function (SSIM) originated from image quality assessment is proposed to train the spectral mapping model in BC speech enhancement, and to our best knowledge, it is the first time that SSIM is deployed in DNN- based speech signal processing tasks. Experimental results show that compared with MSE, SSIM can acquire better objective results and obtain spectra with spectro-temporal structures more similar to the target one. Some adjustments of hyper-parameters in SSIM are made due to the difference between natural image and magnitude spectrogram, and the optimal choice of them are suggested. In addition, the effects of three components in SSIM are analyzed individually, aiming to help further study on the applications of this loss function in other speech signal processing tasks.