Jose Manuel Gil-Cacho, M. Signoretto, T. Waterschoot, M. Moonen, S. H. Jensen
{"title":"Nonlinear Acoustic Echo Cancellation Based on a Sliding-Window Leaky Kernel Affine Projection Algorithm","authors":"Jose Manuel Gil-Cacho, M. Signoretto, T. Waterschoot, M. Moonen, S. H. Jensen","doi":"10.1109/TASL.2013.2260742","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260742","url":null,"abstract":"Acoustic echo cancellation (AEC) is used in speech communication systems where the existence of echoes degrades the speech intelligibility. Standard approaches to AEC rely on the assumption that the echo path to be identified can be modeled by a linear filter. However, some elements introduce nonlinear distortion and must be modeled as nonlinear systems. Several nonlinear models have been used with more or less success. The kernel affine projection algorithm (KAPA) has been successfully applied to many areas in signal processing but not yet to nonlinear AEC (NLAEC). The contribution of this paper is three-fold: (1) to apply KAPA to the NLAEC problem, (2) to develop a sliding-window leaky KAPA (SWL-KAPA) that is well suited for NLAEC applications, and (3) to propose a kernel function, consisting of a weighted sum of a linear and a Gaussian kernel. In our experiment set-up, the proposed SWL-KAPA for NLAEC consistently outperforms the linear APA, resulting in up to 12 dB of improvement in ERLE at a computational cost that is only 4.6 times higher. Moreover, it is shown that the SWL-KAPA outperforms, by 4-6 dB, a Volterra-based NLAEC, which itself has a much higher 413 times computational cost than the linear APA.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eleftheria Georganti, T. May, S. Par, J. Mourjopoulos
{"title":"Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals","authors":"Eleftheria Georganti, T. May, S. Par, J. Mourjopoulos","doi":"10.1109/TASL.2013.2260155","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260155","url":null,"abstract":"A novel method for the estimation of the distance of a sound source from binaural speech signals is proposed. The method relies on several statistical features extracted from such signals and their binaural cues. Firstly, the standard deviation of the difference of the magnitude spectra of the left and right binaural signals is used as a feature for this method. In addition, an extended set of additional statistical features that can improve distance detection is extracted from an auditory front-end which models the peripheral processing of the human auditory system. The method incorporates the above features into two classification frameworks based on Gaussian mixture models and Support Vector Machines and the relative merits of those frameworks are evaluated. The proposed method achieves distance detection when tested in various acoustical environments and performs well in unknown environments. Its performance is also compared to an existing binaural distance detection method.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260155","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Feature Enhancement for Reverberation and Noise Robust Speech Recognition","authors":"Volker Leutnant, A. Krueger, Reinhold Häb-Umbach","doi":"10.1109/TASL.2013.2258013","DOIUrl":"https://doi.org/10.1109/TASL.2013.2258013","url":null,"abstract":"In this contribution we extend a previously proposed Bayesian approach for the enhancement of reverberant logarithmic mel power spectral coefficients for robust automatic speech recognition to the additional compensation of background noise. A recently proposed observation model is employed whose time-variant observation error statistics are obtained as a side product of the inference of the a posteriori probability density function of the clean speech feature vectors. Further a reduction of the computational effort and the memory requirements are achieved by using a recursive formulation of the observation model. The performance of the proposed algorithms is first experimentally studied on a connected digits recognition task with artificially created noisy reverberant data. It is shown that the use of the time-variant observation error model leads to a significant error rate reduction at low signal-to-noise ratios compared to a time-invariant model. Further experiments were conducted on a 5000 word task recorded in a reverberant and noisy environment. A significant word error rate reduction was obtained demonstrating the effectiveness of the approach on real-world data.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2258013","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A General Compression Approach to Multi-Channel Three-Dimensional Audio","authors":"B. Cheng, C. Ritz, I. Burnett, Xiguang Zheng","doi":"10.1109/TASL.2013.2260156","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260156","url":null,"abstract":"This paper presents a technique for low bit rate compression of three-dimensional (3D) audio produced by multiple loudspeaker channels. The approach is based on the time-frequency analysis of the localization of spatial sound sources within the 3D space as rendered by a multi-channel audio signal (in this case 16 channels). This analysis results in the derivation of a stereo downmix signal representing the original 16 channels. Alternatively, a mono-downmix signal with side information representing the location of sound sources within the 3D spatial scene can also be derived. The resulting downmix signals are then compressed with a traditional audio coder, resulting in a representation of the 3D soundfield at bit rates comparable with existing stereo audio coders while maintaining the perceptual quality produced from separate encoding of each channel.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260156","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ville Hautamäki, T. Kinnunen, Filip Sedlak, Kong-Aik Lee, B. Ma, Haizhou Li
{"title":"Sparse Classifier Fusion for Speaker Verification","authors":"Ville Hautamäki, T. Kinnunen, Filip Sedlak, Kong-Aik Lee, B. Ma, Haizhou Li","doi":"10.1109/TASL.2013.2256895","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256895","url":null,"abstract":"State-of-the-art speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Syntax-Based Translation With Bilingually Lexicalized Synchronous Tree Substitution Grammars","authors":"Jiajun Zhang, Feifei Zhai, Chengqing Zong","doi":"10.1109/TASL.2013.2255283","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255283","url":null,"abstract":"Syntax-based models can significantly improve the translation performance due to their grammatical modeling on one or both language side(s). However, the translation rules such as the non-lexical rule “ VP→(x0x1,VP:x1PP:x0)” in string-to-tree models do not consider any lexicalized information on the source or target side. The rule is so generalized that any subtree rooted at VP can substitute for the nonterminal VP:x1. Because rules containing nonterminals are frequently used when generating the target-side tree structures, there is a risk that rules of this type will potentially be severely misused in decoding due to a lack of lexicalization guidance. In this article, inspired by lexicalized PCFG, which is widely used in monolingual parsing, we propose to upgrade the STSG (synchronous tree substitution grammars)-based syntax translation model with bilingually lexicalized STSG. Using the string-to-tree translation model as a case study, we present generative and discriminative models to integrate lexicalized STSG into the translation model. Both small- and large-scale experiments on Chinese-to-English translation demonstrate that the proposed lexicalized STSG can provide superior rule selection in decoding and substantially improve the translation quality.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Class of Algorithms for Time-Frequency Multiplier Estimation","authors":"Anaïk Olivero, B. Torrésani, R. Kronland-Martinet","doi":"10.1109/TASL.2013.2255274","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255274","url":null,"abstract":"We propose here a new approach together with a corresponding class of algorithms for offline estimation of linear operators mapping input to output signals. The operators are modeled as multipliers, i.e., linear and diagonal operator in a frame or Bessel representation of signals (like Gabor, wavelets ...) and characterized by a transfer function. The estimation problem is formulated as a regularized inverse problem, and solved using iterative algorithms, based on gradient descent schemes. Various estimation problems, which differ by a choice for the regularization function, are studied in the case of Gabor multipliers. The transfer function actually provides a meaningful interpretation of the differences between the two signals or signal classes under consideration, and examples are discussed. Furthermore, examples of signal transformations with such Gabor transfer functions are also given.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255274","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Spectral Nature of Maximum Likelihood Noise Compensated Linear Prediction","authors":"L. Weruaga, L. Dimitrov","doi":"10.1109/TASL.2013.2255277","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255277","url":null,"abstract":"The effects of noise in autoregressive (AR) analysis (or linear prediction) and its compensation (NCAR) has been commonly carried out in the time domain under the least square (LS) criterion. This paper studies the adequacy of such an approach by means of a comparative analysis with selected frequency-based NCAR methods. In particular, the maximization of the spectral likelihood (ML) results in a proper optimization problem that is easy to solve and brings useful insights into the rationale of the NCAR problem. On the contrary, popular time-based NCAR methods are shown in the paper to be designed, in the ML context, around ill-conditioned criteria, requiring constraints to guarantee stable solutions. The statistical analysis on a realistic scenario as well as an experiment on speech enhancement complement this analysis.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255277","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Broadband DOA Estimation Using Sensor Arrays on Complex-Shaped Rigid Bodies","authors":"Dumidu S. Talagala, Wen Zhang, T. Abhayapala","doi":"10.1109/TASL.2013.2255282","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255282","url":null,"abstract":"Sensor arrays mounted on complex-shaped rigid bodies are a common feature in many practical broadband direction of arrival (DOA) estimation applications. The scattering and reflections caused by these rigid bodies introduce complexity and diversity in the frequency domain of the channel transfer function, which presents several challenges to existing broadband DOA estimators. This paper presents a novel high resolution broadband DOA estimation technique based on signal subspace decomposition. We describe how broadband signals can be decomposed into narrow subband components, and combined such that the frequency domain diversity is retained. The DOA estimation performance is compared with existing techniques using a uniform circular array and a sensor array on a hypothetical rigid body. An improvement in closely spaced source resolution of up to 6 dB is observed for the sensor array on the hypothetical rigid body, in comparison to the uniform circular array. The results suggest that frequency domain diversity, introduced by complex-shaped rigid bodies, can provide higher resolution and clearer separation of closely spaced broadband sound sources.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Free-Source Method (FrSM) for Calibrating a Large-Aperture Microphone Array","authors":"Sarthak Khanal, H. Silverman, Rahul R. Shakya","doi":"10.1109/TASL.2013.2256896","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256896","url":null,"abstract":"Large-aperture microphone arrays can be used to capture and enhance speech from individual talkers in noisy, multi-talker, and reverberant environments. However, they must be calibrated, often more than once, to obtain accurate 3-dimensional coordinates for all microphones. Direct-measurement techniques, such as using a measuring tape or a laser-based tool are cumbersome and time-consuming. Some previous methods that used acoustic signals for array calibration required bulky hardware and/or fixed, known source locations. Others, which allowed more flexible source placement, often have issues with real data, have reported results for 2D only, or work only for small arrays. This paper describes a complete and robust method for automatic calibration using acoustic signals which is simple, repeatable, accurate, and has been shown to work for a real system. The method requires only a single transducer (speaker) with a microphone attached above its center. The unit is freely moved around the focal volume of the microphone array generating a single long recording from all the microphones. After that, the system is completely automatic. We describe the free source method (FrSM), validate its effectiveness and present accuracy results against measured ground truth. The performance of FrSM is compared to that from several other methods for a real 128-microphone array.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}