基于耳间空间线索和空间协方差模型的两阶段视听语音去噪与分离

2013 18th International Conference on Digital Signal Processing (DSP) Pub Date : 2013-07-01 DOI:10.1109/ICDSP.2013.6622780

Muhammad Salman Khan, S. M. Naqvi, J. Chambers

{"title":"基于耳间空间线索和空间协方差模型的两阶段视听语音去噪与分离","authors":"Muhammad Salman Khan, S. M. Naqvi, J. Chambers","doi":"10.1109/ICDSP.2013.6622780","DOIUrl":null,"url":null,"abstract":"This work presents a two-stage speech source separation algorithm based on combined models of interaural cues and spatial covariance which utilize knowledge of the locations of the sources estimated through video. In the first pre-processing stage the late reverberant speech components are suppressed by a spectral subtraction rule to dereverberate the observed mixture. In the second stage, the binaural spatial parameters, the interaural phase difference and the interaural level difference, and the spatial covariance are modeled in the short-time Fourier transform (STFT) domain to classify individual time-frequency (TF) units to each source. The parameters of these probabilistic models and the TF regions assigned to each source are updated with the expectation-maximization (EM) algorithm. The algorithm generates TF masks that are used to reconstruct the individual speech sources. Objective results, in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ), confirm that the proposed multimodal method with pre-processing is a promising approach for source separation in highly reverberant rooms.","PeriodicalId":180360,"journal":{"name":"2013 18th International Conference on Digital Signal Processing (DSP)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance\",\"authors\":\"Muhammad Salman Khan, S. M. Naqvi, J. Chambers\",\"doi\":\"10.1109/ICDSP.2013.6622780\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work presents a two-stage speech source separation algorithm based on combined models of interaural cues and spatial covariance which utilize knowledge of the locations of the sources estimated through video. In the first pre-processing stage the late reverberant speech components are suppressed by a spectral subtraction rule to dereverberate the observed mixture. In the second stage, the binaural spatial parameters, the interaural phase difference and the interaural level difference, and the spatial covariance are modeled in the short-time Fourier transform (STFT) domain to classify individual time-frequency (TF) units to each source. The parameters of these probabilistic models and the TF regions assigned to each source are updated with the expectation-maximization (EM) algorithm. The algorithm generates TF masks that are used to reconstruct the individual speech sources. Objective results, in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ), confirm that the proposed multimodal method with pre-processing is a promising approach for source separation in highly reverberant rooms.\",\"PeriodicalId\":180360,\"journal\":{\"name\":\"2013 18th International Conference on Digital Signal Processing (DSP)\",\"volume\":\"11 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 18th International Conference on Digital Signal Processing (DSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDSP.2013.6622780\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 18th International Conference on Digital Signal Processing (DSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDSP.2013.6622780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

这项工作提出了一种基于耳间线索和空间协方差组合模型的两阶段语音源分离算法，该算法利用了通过视频估计的源位置的知识。在第一预处理阶段，后期混响语音成分被频谱减法规则抑制，以使观察到的混合去噪。在第二阶段，在短时傅里叶变换(STFT)域中对双耳空间参数、耳间相位差和耳间电平差以及空间协方差进行建模，将单个时频(TF)单元分类到每个源。使用期望最大化算法更新这些概率模型的参数和分配给每个源的TF区域。该算法生成用于重建单个语音源的TF掩码。从信失真比(SDR)和语音质量感知评价(PESQ)的角度来看，客观结果证实了该预处理多模态方法是一种很有前途的高混响房间源分离方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance

This work presents a two-stage speech source separation algorithm based on combined models of interaural cues and spatial covariance which utilize knowledge of the locations of the sources estimated through video. In the first pre-processing stage the late reverberant speech components are suppressed by a spectral subtraction rule to dereverberate the observed mixture. In the second stage, the binaural spatial parameters, the interaural phase difference and the interaural level difference, and the spatial covariance are modeled in the short-time Fourier transform (STFT) domain to classify individual time-frequency (TF) units to each source. The parameters of these probabilistic models and the TF regions assigned to each source are updated with the expectation-maximization (EM) algorithm. The algorithm generates TF masks that are used to reconstruct the individual speech sources. Objective results, in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ), confirm that the proposed multimodal method with pre-processing is a promising approach for source separation in highly reverberant rooms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 18th International Conference on Digital Signal Processing (DSP)

自引率

0.00%

发文量