{"title":"Reverberation and Noise Robust Feature Compensation Based on IMM","authors":"C. Han, S. Kang, N. Kim","doi":"10.1109/TASL.2013.2256893","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256893","url":null,"abstract":"In this paper, we propose a novel feature compensation approach based on the interacting multiple model (IMM) algorithm specially designed for joint processing of background noise and acoustic reverberation. Our approach to cope with the time-varying environmental parameters is to establish a switching linear dynamic model for the additive and convolutive distortions, such as the background noise and acoustic reverberation, in the log-spectral domain. We construct multiple state space models with the speech corruption process in which the log spectra of clean speech and log frequency response of acoustic reverberation are jointly handled as the state of our interest. The proposed approach shows significant improvements in the Aurora-5 automatic speech recognition (ASR) task which was developed to investigate the influence on the performance of ASR for a hands-free speech input in noisy room environments.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256893","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MDCT Sinusoidal Analysis for Audio Signals Analysis and Processing","authors":"Shuhua Zhang, W. Dou, Huazhong Yang","doi":"10.1109/TASL.2013.2250963","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250963","url":null,"abstract":"The Modified Discrete Cosine Transform (MDCT) is widely used in audio signals compression, but mostly limited to representing audio signals. This is because the MDCT is a real transform: Phase information is missing and spectral power varies frame to frame even for pure sine waves. We have a key observation concerning the structure of the MDCT spectrum of a sine wave: Across frames, the complete spectrum changes substantially, but if separated into even and odd subspectra, neither changes except scaling. Inspired by this observation, we find that the MDCT spectrum of a sine wave can be represented as an envelope factor times a phase-modulation factor. The first one is shift-invariant and depends only on the sine wave's amplitude and frequency, thus stays constant over frames. The second one has the form of sinθ for all odd bins and cosθ for all even bins, leading to subspectra's constant shapes. But this θ depends on the start point of a transform frame, therefore, changes at each new frame, and then changes the whole spectrum. We apply this formulation of the MDCT spectral structure to frequency estimation in the MDCT domain, both for pure sine waves and sine waves with noises. Compared to existing methods, ours are more accurate and more general (not limited to the sine window). We also apply the spectral structure to stereo coding. A pure tone or tone-dominant stereo signal may have very different left and right MDCT spectra, but their subspectra have similar shapes. One ratio for even bins and one ratio for odd bins will be enough to reconstruct the right from the left, saving half bitrate. This scheme is simple and at the same time more efficient than the traditional Intensity Stereo (IS).","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250963","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balaji Vasan Srinivasan, Yuancheng Luo, D. Garcia-Romero, D. Zotkin, R. Duraiswami
{"title":"A Symmetric Kernel Partial Least Squares Framework for Speaker Recognition","authors":"Balaji Vasan Srinivasan, Yuancheng Luo, D. Garcia-Romero, D. Zotkin, R. Duraiswami","doi":"10.1109/TASL.2013.2253096","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253096","url":null,"abstract":"I-vectors are concise representations of speaker characteristics. Recent progress in i-vectors related research has utilized their ability to capture speaker and channel variability to develop efficient automatic speaker verification (ASV) systems. Inter-speaker relationships in the i-vector space are non-linear. Accomplishing effective speaker verification requires a good modeling of these non-linearities and can be cast as a machine learning problem. Kernel partial least squares (KPLS) can be used for discriminative training in the i-vector space. However, this framework suffers from training data imbalance and asymmetric scoring. We use “one shot similarity scoring” (OSS) to address this. The resulting ASV system (OSS-KPLS) is tested across several conditions of the NIST SRE 2010 extended core data set and compared against state-of-the-art systems: Joint Factor Analysis (JFA), Probabilistic Linear Discriminant Analysis (PLDA), and Cosine Distance Scoring (CDS) classifiers. Improvements are shown.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-Based Inversion of Dynamic Range Compression","authors":"Stanislaw Gorlow, J. Reiss","doi":"10.1109/TASL.2013.2253099","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253099","url":null,"abstract":"In this work it is shown how a dynamic nonlinear time-variant operator, such as a dynamic range compressor, can be inverted using an explicit signal model. By knowing the model parameters that were used for compression one is able to recover the original uncompressed signal from a “broadcast” signal with high numerical accuracy and very low computational complexity. A compressor-decompressor scheme is worked out and described in detail. The approach is evaluated on real-world audio material with great success.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253099","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLOSE—A Data-Driven Approach to Speech Separation","authors":"J. Ming, R. Srinivasan, D. Crookes, Ayeh Jafari","doi":"10.1109/TASL.2013.2250959","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250959","url":null,"abstract":"This paper studies single-channel speech separation, assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments. To advance the separation accuracy, the new approach seeks and separates the longest mixed speech segments with matching composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the constituent training segments, and hence the error of separation. For convenience, we call the new approach Composition of Longest Segments, or CLOSE. The CLOSE method includes a data-driven approach to model long-range temporal dynamics of speech signals, and a statistical approach to identify the longest mixed speech segments with matching composite training segments. Experiments are conducted on the Wall Street Journal database, for separating mixtures of two simultaneous large-vocabulary speech utterances spoken by two different speakers. The results are evaluated using various objective and subjective measures, including the challenge of large-vocabulary continuous speech recognition. It is shown that the new separation approach leads to significant improvement in all these measures.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250959","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Perceptual Study on Velvet Noise and Its Variants at Different Pulse Densities","authors":"V. Välimäki, Heidi-Maria Lehtonen, M. Takanen","doi":"10.1109/TASL.2013.2255281","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255281","url":null,"abstract":"This paper investigates sparse noise sequences, including the previously proposed velvet noise and its novel variants defined here. All sequences consist of sample values minus one, zero, and plus one only, and the location and the sign of each impulse is randomly chosen. Two of the proposed algorithms are direct variants of the original velvet noise requiring two random number sequences for determining the impulse locations and signs. In one of the proposed algorithms the impulse locations and signs are drawn from the same random number sequence, which is advantageous in terms of implementation. Moreover, two of the new sequences include known regions of zeros. The perceived smoothness of the proposed sequences was studied with a listening test in which test subjects compared the noise sequences against a reference signal that was a Gaussian white noise. The results show that the original velvet noise sounds smoother than the reference at 2000 impulses per second. At 4000 impulses per second, also three of the proposed algorithms are perceived smoother than the Gaussian noise sequence. These observations can be exploited in the synthesis of noisy sounds and in artificial reverberation.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic-Deterministic MMSE STFT Speech Enhancement With General A Priori Information","authors":"Matthew C. McCallum, B. Guillemin","doi":"10.1109/TASL.2013.2253100","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253100","url":null,"abstract":"A wide range of Bayesian short-time spectral amplitude (STSA) speech enhancement algorithms exist, varying in both the statistical model used for speech and the cost functions considered. Current algorithms of this class consistently assume that the distribution of clean speech short time Fourier transform (STFT) samples are either randomly distributed with zero mean or deterministic. No single distribution function has been considered that captures both deterministic and random signal components. In this paper a Bayesian STSA algorithm is proposed under a stochastic-deterministic (SD) speech model that makes provision for the inclusion of a priori information by considering a non-zero mean. Analytical expressions are derived for the speech STFT magnitude in the MMSE sense, and phase in the maximum-likelihood sense. Furthermore, a practical method of estimating the a priori SD speech model parameters is described based on explicit consideration of harmonically related sinusoidal components in each STFT frame, and variations in both the magnitude and phase of these components between successive STFT frames. Objective tests using the PESQ measure indicate that the proposed algorithm results in superior speech quality when compared to several other speech enhancement algorithms. In particular it is clear that the proposed algorithm has an improved capability to retain low amplitude voiced speech components in low SNR conditions.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253100","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Scaling Up Classification-Based Speech Separation","authors":"Yuxuan Wang, Deliang Wang","doi":"10.1109/TASL.2013.2250961","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250961","url":null,"abstract":"Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250961","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Arberet, P. Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux
{"title":"Sparse Reverberant Audio Source Separation via Reweighted Analysis","authors":"S. Arberet, P. Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux","doi":"10.1109/TASL.2013.2250962","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250962","url":null,"abstract":"We propose a novel algorithm for source signals estimation from an underdetermined convolutive mixture assuming known mixing filters. Most of the state-of-the-art methods are dealing with anechoic or short reverberant mixture, assuming a synthesis sparse prior in the time-frequency domain and a narrowband approximation of the convolutive mixing process. In this paper, we address the source estimation of convolutive mixtures with a new algorithm based on i) an analysis sparse prior, ii) a reweighting scheme so as to increase the sparsity, iii) a wideband data-fidelity term in a constrained form. We show, through theoretical discussions and simulations, that this algorithm is particularly well suited for source separation of realistic reverberation mixtures. Particularly, the proposed algorithm outperforms state-of-the-art methods on reverberant mixtures of audio sources by more than 2 dB of signal-to-distortion ratio on the BSS Oracle dataset.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250962","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-Based Unsupervised Spoken Term Detection with Spoken Queries","authors":"Chun-an Chan, Lin-Shan Lee","doi":"10.1109/TASL.2013.2248714","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248714","url":null,"abstract":"We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248714","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}