{"title":"Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays","authors":"Symeon Delikaris-Manias, V. Pulkki","doi":"10.1109/TASL.2013.2277928","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277928","url":null,"abstract":"A parametric spatial filtering algorithm with a fixed beam direction is proposed in this paper. The algorithm utilizes the normalized cross-spectral density between signals from microphones of different orders as a criterion for focusing in specific directions. The correlation between microphone signals is estimated in the time-frequency domain. A post-filter is calculated from a multichannel input and is used to assign attenuation values to a coincidentally captured audio signal. The proposed algorithm is simple to implement and offers the capability of coping with interfering sources at different azimuthal locations with or without the presence of diffuse sound. It is implemented by using directional microphones placed in the same look direction and have the same magnitude and phase response. Experiments are conducted with simulated and real microphone arrays employing the proposed post-filter and compared to previous coherence-based approaches, such as the McCowan post-filter. A significant improvement is demonstrated in terms of objective quality measures. Formal listening tests conducted to assess the audibility of artifacts of the proposed algorithm in real acoustical scenarios show that no annoying artifacts existed with certain spectral floor values. Examples of the proposed algorithm can be found online at http://www.acoustics.hut.fi/projects/cropac/soundExamples.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2356-2367"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277928","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array","authors":"Pasi Pertilä, M. Hämäläinen, Mikael Mieskolainen","doi":"10.1109/TASLP.2013.2286921","DOIUrl":"https://doi.org/10.1109/TASLP.2013.2286921","url":null,"abstract":"In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2393-2402"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASLP.2013.2286921","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie
{"title":"Second Order Methods for Optimizing Convex Matrix Functions and Sparse Covariance Clustering","authors":"Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie","doi":"10.1109/TASL.2013.2263142","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263142","url":null,"abstract":"A variety of first-order methods have recently been proposed for solving matrix optimization problems arising in machine learning. The premise for utilizing such algorithms is that second order information is too expensive to employ, and so simple first-order iterations are likely to be optimal. In this paper, we argue that second-order information is in fact efficiently accessible in many matrix optimization problems, and can be effectively incorporated into optimization algorithms. We begin by reviewing how certain Hessian operations can be conveniently represented in a wide class of matrix optimization problems, and provide the first proofs for these results. Next we consider a concrete problem, namely the minimization of the ℓ1 regularized Jeffreys divergence, and derive formulae for computing Hessians and Hessian vector products. This allows us to propose various second order methods for solving the Jeffreys divergence problem. We present extensive numerical results illustrating the behavior of the algorithms and apply the methods to a speech recognition problem. We compress full covariance Gaussian mixture models utilized for acoustic models in automatic speech recognition. By discovering clusters of (sparse inverse) covariance matrices, we can compress the number of covariance parameters by a factor exceeding 200, while still outperforming the word error rate (WER) performance of a diagonal covariance model that has 20 times less covariance parameters than the original acoustic model.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"123 1","pages":"2244-2254"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263142","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikos Malandrakis, A. Potamianos, Elias Iosif, Shrikanth S. Narayanan
{"title":"Distributional Semantic Models for Affective Text Analysis","authors":"Nikos Malandrakis, A. Potamianos, Elias Iosif, Shrikanth S. Narayanan","doi":"10.1109/TASL.2013.2277931","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277931","url":null,"abstract":"We present an affective text analysis model that can directly estimate and combine affective ratings of multi-word terms, with application to the problem of sentence polarity/semantic orientation detection. Starting from a hierarchical compositional method for generating sentence ratings, we expand the model by adding multi-word terms that can capture non-compositional semantics. The method operates similarly to a bigram language model, using bigram terms or backing off to unigrams based on a (degree of) compositionality criterion. The affective ratings for n-gram terms of different orders are estimated via a corpus-based method using distributional semantic similarity metrics between unseen words and a set of seed words. N-gram ratings are then combined into sentence ratings via simple algebraic formulas. The proposed framework produces state-of-the-art results for word-level tasks in English and German and the sentence-level news headlines classification SemEval'07-Task14 task. The inclusion of bigram terms to the model provides significant performance improvement, even if no term selection is applied.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2379-2392"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277931","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kanevsky, Xiaodong He, G. Heigold, Haizhou Li, Stephen J. Wright
{"title":"Introduction to the Special Section on Large-Scale Optimization for Audio, Speech, and Language Processing","authors":"D. Kanevsky, Xiaodong He, G. Heigold, Haizhou Li, Stephen J. Wright","doi":"10.1109/TASL.2013.2283631","DOIUrl":"https://doi.org/10.1109/TASL.2013.2283631","url":null,"abstract":"The six papers in this special section on large-scale optimization for Audio, Speech, and Language Processing are summarized here.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2229-2230"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2283631","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tara N. Sainath, Brian Kingsbury, H. Soltau, B. Ramabhadran
{"title":"Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks","authors":"Tara N. Sainath, Brian Kingsbury, H. Soltau, B. Ramabhadran","doi":"10.1109/TASL.2013.2284378","DOIUrl":"https://doi.org/10.1109/TASL.2013.2284378","url":null,"abstract":"While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training these networks is slow. Even to date, the most common approach to train DNNs is via stochastic gradient descent, serially on one machine. Serial training, coupled with the large number of training parameters (i.e., 10-50 million) and speech data set sizes (i.e., 20-100 million training points) makes DNN training very slow for LVCSR tasks. In this work, we explore a variety of different optimization techniques to improve DNN training speed. This includes parallelization of the gradient computation during cross-entropy and sequence training, as well as reducing the number of parameters in the network using a low-rank matrix factorization. Applying the proposed optimization techniques, we show that DNN training can be sped up by a factor of 3 on a 50-hour English Broadcast News (BN) task with no loss in accuracy. Furthermore, using the proposed techniques, we are able to train DNNs on a 300-hr Switchboard (SWB) task and a 400-hr English BN task, showing improvements between 9-30% relative over a state-of-the art GMM/HMM system while the number of parameters of the DNN is smaller than the GMM/HMM system.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2267-2276"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2284378","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation","authors":"Theodoros Tsiligkaridis, E. Marcheret, V. Goel","doi":"10.1109/TASL.2013.2271592","DOIUrl":"https://doi.org/10.1109/TASL.2013.2271592","url":null,"abstract":"We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2255-2266"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2271592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Room Reverberation Reconstruction: Interpolation of the Early Part Using Compressed Sensing","authors":"R. Mignot, L. Daudet, F. Ollivier","doi":"10.1109/TASL.2013.2273662","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273662","url":null,"abstract":"This paper deals with the interpolation of the Room Impulse Responses (RIRs) within a whole volume, from as few measurements as possible, and without the knowledge of the geometry of the room. We focus on the early reflections of the RIRs, that have the key property of being sparse in the time domain: this can be exploited in a framework of model-based Compressed Sensing. Starting from a set of RIRs randomly sampled in the spatial domain of interest by a 3D microphone array, we propose a modified Matching Pursuit algorithm to estimate the position of a small set of virtual sources. Then, the reconstruction of the RIRs at interpolated positions is performed using a projection onto a basis of monopoles, which correspond to the estimated virtual sources. An extension of the proposed algorithm allows the interpolation of the positions of both source and receiver, using the acquisition of four different source positions. This approach is validated both by numerical examples, and by experimental measurements using a 3D array with up to 120 microphones.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2301-2312"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273662","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen J. Wright, D. Kanevsky, L. Deng, Xiaodong He, G. Heigold, Haizhou Li
{"title":"Optimization Algorithms and Applications for Speech and Language Processing","authors":"Stephen J. Wright, D. Kanevsky, L. Deng, Xiaodong He, G. Heigold, Haizhou Li","doi":"10.1109/TASL.2013.2283777","DOIUrl":"https://doi.org/10.1109/TASL.2013.2283777","url":null,"abstract":"Optimization techniques have been used for many years in the formulation and solution of computational problems arising in speech and language processing. Such techniques are found in the Baum-Welch, extended Baum-Welch (EBW), Rprop, and GIS algorithms, for example. Additionally, the use of regularization terms has been seen in other applications of sparse optimization. This paper outlines a range of problems in which optimization formulations and algorithms play a role, giving some additional details on certain application problems in machine translation, speaker/language recognition, and automatic speech recognition. Several approaches developed in the speech and language processing communities are described in a way that makes them more recognizable as optimization procedures. Our survey is not exhaustive and is complemented by other papers in this volume.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2231-2243"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2283777","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Vocabulary Speech Recognition on Parallel Architectures","authors":"P. Cardinal, P. Dumouchel, Gilles Boulianne","doi":"10.1109/TASL.2013.2271591","DOIUrl":"https://doi.org/10.1109/TASL.2013.2271591","url":null,"abstract":"The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. The parallelization of the classical Viterbi beam search has been shown to be very difficult on multi-core processor architectures or massively threaded architectures such as Graphics Processing Unit (GPU). The problem with this approach is that active states are scattered in memory and thus, they cannot be efficiently transferred to the processor memory. This problem can be circumvented by using the A* search which uses a heuristic to significantly reduce the number of explored hypotheses. The main advantage of this algorithm is that the processing time is moved from the search in the recognition network to the computation of heuristic costs, which can be designed to take advantage of parallel architectures. Our parallel implementation of the A* decoder on a 4-core processor with a GPU led to a speed-up factor of 6.13 compared to the Viterbi beam search at its maximum capacity and an improvement of 4% absolute in accuracy at real-time.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2290-2300"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2271591","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}