{"title":"Time-Frequency Mask-based Speech Enhancement using Convolutional Generative Adversarial Network","authors":"Neil Shah, H. Patil, Meet H. Soni","doi":"10.23919/APSIPA.2018.8659692","DOIUrl":null,"url":null,"abstract":"Speech Enhancement (SE) system deals with improving the perceptual quality and preserving the speech intelligibility of the noisy mixture. The Time-Frequency (T-F) masking-based SE using the supervised learning algorithm, such as a Deep Neural Network (DNN), has outperformed the traditional SE techniques. However, the notable difference observed between the oracle mask and the predicted mask, motivates us to explore different deep learning architectures. In this paper, we propose to use a Convolutional Neural Network (CNN)-based Generative Adversarial Network (GAN) for inherent mask estimation. GAN takes an advantage of the adversarial optimization, an alternative to the other Maximum Likelihood (ML) optimization-based architectures. We also show the need for supervised T-F mask estimation for effective noise suppression. Experimental results demonstrate that the proposed T-F mask-based SE significantly outperforms the recently proposed end-to-end SEGAN and a GAN-based Pix2Pix architecture. The performance evaluation in terms of both the predicted mask and the objective measures, dictates the improvement in the speech quality, while simultaneously reducing the speech distortion observed in the noisy mixture.","PeriodicalId":287799,"journal":{"name":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPA.2018.8659692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
Abstract
Speech Enhancement (SE) system deals with improving the perceptual quality and preserving the speech intelligibility of the noisy mixture. The Time-Frequency (T-F) masking-based SE using the supervised learning algorithm, such as a Deep Neural Network (DNN), has outperformed the traditional SE techniques. However, the notable difference observed between the oracle mask and the predicted mask, motivates us to explore different deep learning architectures. In this paper, we propose to use a Convolutional Neural Network (CNN)-based Generative Adversarial Network (GAN) for inherent mask estimation. GAN takes an advantage of the adversarial optimization, an alternative to the other Maximum Likelihood (ML) optimization-based architectures. We also show the need for supervised T-F mask estimation for effective noise suppression. Experimental results demonstrate that the proposed T-F mask-based SE significantly outperforms the recently proposed end-to-end SEGAN and a GAN-based Pix2Pix architecture. The performance evaluation in terms of both the predicted mask and the objective measures, dictates the improvement in the speech quality, while simultaneously reducing the speech distortion observed in the noisy mixture.