{"title":"RefineNet-Based Speech Enhancement with Composite Metric Loss Training","authors":"Chuan Peng, Tian Lan, Yuxin Qian, M. Li, Qiao Liu","doi":"10.1109/ICCT46805.2019.8947267","DOIUrl":null,"url":null,"abstract":"Speech enhancement is a task to improve the quality and intelligibility of degraded speech. Recent work shows convolutional neural network (CNN) with encoder-decoder architecture can achieve better performance with fewer parameters than the feed-forward neural network (FNN) and recurrent neural network (RNN). It inspires us to build a CNN model based on state-of-the-art encoder-decoder architecture, RefineNet, whose effectiveness has been proved for high-resolution semantic segmentation. In this work, RefineNet is used to exploit multi-level time-frequency features for generating high-level features. Furthermore, some works concern that the inconsistency between evaluation metrics and training loss may result in failing to obtain the optimal model by training. Therefore, we take metrics as loss and composite them. Furthermore, the waveform MSE loss, which contains phase information, is also composited to compensate for using noisy phase to reconstruct enhanced speech. Experiments show our model achieves high quality and intelligibility and outperform baselines, especially at a low signal-to-noise ratio.","PeriodicalId":306112,"journal":{"name":"2019 IEEE 19th International Conference on Communication Technology (ICCT)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 19th International Conference on Communication Technology (ICCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCT46805.2019.8947267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech enhancement is a task to improve the quality and intelligibility of degraded speech. Recent work shows convolutional neural network (CNN) with encoder-decoder architecture can achieve better performance with fewer parameters than the feed-forward neural network (FNN) and recurrent neural network (RNN). It inspires us to build a CNN model based on state-of-the-art encoder-decoder architecture, RefineNet, whose effectiveness has been proved for high-resolution semantic segmentation. In this work, RefineNet is used to exploit multi-level time-frequency features for generating high-level features. Furthermore, some works concern that the inconsistency between evaluation metrics and training loss may result in failing to obtain the optimal model by training. Therefore, we take metrics as loss and composite them. Furthermore, the waveform MSE loss, which contains phase information, is also composited to compensate for using noisy phase to reconstruct enhanced speech. Experiments show our model achieves high quality and intelligibility and outperform baselines, especially at a low signal-to-noise ratio.