{"title":"Time-Frequency Loss for CNN Based Speech Super-Resolution","authors":"Heming Wang, Deliang Wang","doi":"10.1109/ICASSP40776.2020.9053712","DOIUrl":null,"url":null,"abstract":"Speech super-resolution (SR), also called speech bandwidth extension (BWE), aims to increase the sampling rate of a given lower resolution speech signal. Recent years have witnessed the successful application of deep neural networks in time or frequency domains, and deep learning has improved the performance considerably compared with conventional approaches. This paper proposes an autoencoder based fully convolutional neural network (CNN) that merges the information from both time and frequency domains. At the training time, we optimize the CNN using a new time-frequency loss (T-F loss), which combines a time domain loss and a frequency domain loss. The experimental results show that our model trained with the T-F loss achieves significantly better results than other state-of-the-art models, and yields balanced performance in terms of time and frequency metrics.","PeriodicalId":13127,"journal":{"name":"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"10 1","pages":"861-865"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP40776.2020.9053712","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Speech super-resolution (SR), also called speech bandwidth extension (BWE), aims to increase the sampling rate of a given lower resolution speech signal. Recent years have witnessed the successful application of deep neural networks in time or frequency domains, and deep learning has improved the performance considerably compared with conventional approaches. This paper proposes an autoencoder based fully convolutional neural network (CNN) that merges the information from both time and frequency domains. At the training time, we optimize the CNN using a new time-frequency loss (T-F loss), which combines a time domain loss and a frequency domain loss. The experimental results show that our model trained with the T-F loss achieves significantly better results than other state-of-the-art models, and yields balanced performance in terms of time and frequency metrics.