{"title":"Perceptually-Motivated Nonlinear Channel Decorrelation for Stereo Acoustic Echo Cancellation","authors":"J. Valin","doi":"10.1109/HSCMA.2008.4538718","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538718","url":null,"abstract":"Acoustic echo cancellation with stereo signals is generally an under-determined problem because of the high coherence between the left and right channels. In this paper, we present a novel method of significantly reducing inter-channel coherence without affecting the audio quality. Our work takes into account psychoacoustic masking and binaural auditory cues. The proposed non-linear processing combines a shaped comb-allpass (SCAL) filter with the injection of psychoacoustically masked noise. We show that the proposed method performs significantly better than other known methods for reducing inter-channel coherence.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"323 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132454527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis","authors":"J. Park, J. Yoon, H. Kim","doi":"10.1093/ietisy/e91-d.9.2360","DOIUrl":"https://doi.org/10.1093/ietisy/e91-d.9.2360","url":null,"abstract":"In this paper, we propose a new mask estimation method for the computational auditory scene analysis (CASA) of speech using two microphones. The proposed method is based on a hidden Markov model (HMM) in order to incorporate an observation that the mask information should be correlated over contiguous analysis frames. In other words, HMM is used to estimate the mask information represented as the interaural time difference (ITD) and the interaural level difference (ILD) of two channel signals, and the estimated mask information is finally employed in the separation of desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we then compare the performance of the proposed method with that of a Gaussian kernel-based estimation method in terms of the performance of speech recognition. As a result, the proposed HMM-based mask estimation method provided an average word error rate reduction of 69.14% when compared with the Gaussian kernel-based mask estimation method.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114663957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beamforming with Optimized Interpolated Microphone Arrays","authors":"G. Doblinger","doi":"10.1109/HSCMA.2008.4538681","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538681","url":null,"abstract":"We present an optimization procedure for wideband beam- forming with interpolated arrays. We intend to design a beam- former with a compact size. In addition, we want to reduce the number of sensors while maintaining a good beamforming performance. Our beamformers are implemented using FFT filterbanks. Performance is tested under far-field conditions and under sound propagation with simulated room impulse responses. In addition, we study the influence of sensor noise on the beamforming behavior.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Receive Side Processing for Automotive Hands-Free Systems","authors":"B. Iser, G. Schmidt","doi":"10.1109/HSCMA.2008.4538730","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538730","url":null,"abstract":"In the sending path of automotive hands-free systems several subunits - such as acoustic echo cancellation (AEC) and noise reduction (NR) - improve the quality of the outgoing signal. These units are usually realized in the frequency or subband domain in order to reduce the computational complexity. In the receiving path, however, only a few signal processing stages - such as bandwidth extension (BWE) [1] or gain adjustment - are realized in recent systems [2, 3]. These units are implemented in most cases in the time domain, since two analysis-synthesis schemes (one in the sending and one in the receiving path) would introduce more delay than allowed by ITU- or VDA-recommendations [4]. According to the best knowledge of the authors linking of conventional processing schemes in the sending path (AEC and NR) with those of the receiving path has not yet been addressed in research on hands-free systems. For the car environment some amplifier manufacturers perform a volume control in dependence of the driving speed of the car. Some have even the possibility of placing a microphone in the cabin for measuring the noise level within the car [2, 5]. But this does not apply to hands-free telephony. The estimated power spectral density (PSD) of the background noise (already estimated within the NR unit) can be used to adjust the BWE unit. Since in high noise conditions, artifacts introduced by a BWE scheme are less audible a stronger extension can be used compared to stand-still operation. Taking also the estimated echo spectrum into account (beside the noise PSD) an estimate for the SNR within the car cabin can be obtained. Using this estimate one could perform an automatic gain control of the receive signal for retaining a particular SNR within the car while the noise or the speaking level of the remote partner is changing. This can also be done in a frequency specific manner, resulting in a frequency selective adaptive equalization. No further microphone has to be placed in the cabin and the volume can be controlled independent of the amplifier using the resources (AEC, NR) already available.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130043092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microphone Array Front-End Interface for Home Automation","authors":"G.E. Coelho, A. Serralheiro, J.P. Netti","doi":"10.1109/HSCMA.2008.4538717","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538717","url":null,"abstract":"In this paper we present a microphone array (MA) interface to a Spoken Dialog System. Our goal is to create a hands- free home automation system with a vocal interface to control home devices. The user establishes a dialog with a virtual butler that is able to control a plethora of home devices, such as ceiling lights, air-conditioner, windows shades, hi-fi and TV features. A MA is used for the speech acquisition front-end. The multi-channel audio acquisition is pre-processed in real-time, performing speech enhancement with Delay-and-Sum Beamforming algorithm. The Direction of Arrival is estimated with the Generalized Cross Correlation with Phase Transform algorithm, enabling us to track the user. The enhanced speech signal is then processed in order to recognize orally issued commands that will control the house appliances. This paper describes the complete system emphasizing the MA and its implications on command recognition performance.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126227772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Nakatania, T. Yoshiokaa, K. Kinoshita, M. Miyoshi, B. Juang
{"title":"Speech Dereverberation in Short Time Fourier Transform Domain with Crossband Effect Compensation","authors":"T. Nakatania, T. Yoshiokaa, K. Kinoshita, M. Miyoshi, B. Juang","doi":"10.1109/HSCMA.2008.4538726","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538726","url":null,"abstract":"It has recently been shown that the maximum likelihood estimation approach with a time-varying source model is very effective in achieving speech dereverberation based only on a short observation. In addition, STFT domain processing has been shown to be promising for implementing this dereverberation approach in a computationally efficient way. This paper presents a way of further improving the STFT domain speech dereverberation in terms of both computational cost and accuracy. One important issue here is how to calculate time-domain convolution with a long filter precisely using STFT. We introduce an STFT domain filtering method with crossband effect compensation for this purpose. Experimental results show that the proposed method allows us to implement the dereverberation algorithm in the STFT domain more precisely with less computational cost than the existing method.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130499734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Use of Empirically Determined Impulse Responses for Improving Distant Talking Speech Recognition","authors":"T. Plotz, G. Fink","doi":"10.1109/HSCMA.2008.4538710","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538710","url":null,"abstract":"Recognition rates of distant talking speech recognition applications substantially decrease if the acoustic environment contains reverberation. Although standard approaches for compensating such distortions, e.g. cepstral mean subtraction (CMS), are quite effective, they are not appropriate for dynamic human machine interaction. When only short portions of speech are uttered by speakers at different positions, compensation methods fail that require several seconds of speech. For this kind of applications we present a dereverberation approach utilizing empirically determined impulse responses. Prior to speaking users are asked to produce some impulse-like signal (clapping their hands, or snipping the fingers) which is used for compensation. By means of an experimental evaluation on the German Verbmobil corpus we demonstrate the promising potential of the approach.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132515129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Joint Particle Filter and Multi-Step Linear Prediction Framework to Provide Enhanced Speech Features Prior to Automatic Recognition","authors":"M. Wolfel","doi":"10.1109/HSCMA.2008.4538704","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538704","url":null,"abstract":"Automatic speech recognition, which works well on recordings captured with mid- or far-field microphones, is essential for a natural verbal communication between humans and machines. While a great deal of research effort has addressed one of the two distortions frequently encountered in mid- and far-field sound capture, namely non-stationary noise and reverberation, much less work has undertaken to jointly combat both kinds of distortions. In our view, however, this joint approach is essential in order to further reduce catastrophic effects of noise and reverberation that are encountered as soon as the microphone is more than a few centimeters from the speaker's mouth. We propose here to integrate an estimate of the reverberation obtained by multi-step linear prediction into a particle filter framework that tracks and removes non-stationary additive distortions. Evaluations on actual recordings with different speaker to microphone distances demonstrate that techniques combating either non-stationary noise or reverberation can be combined for good effect.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131983026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum Likelihood Time Delay Estimation with Phase Domain Analysis in the Generalized Cross Correlation Framework","authors":"Bowon Lee, A. Said, T. Kalker, R. Schafer","doi":"10.1109/HSCMA.2008.4538695","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538695","url":null,"abstract":"We propose a new method for efficiently estimating the maximum likelihood frequency weighting in the generalized cross correlation framework for time delay estimation. The estimation is based on the analysis of the cross spectrum between a pair of microphones. We model how phase distribution is affected by both noise and reverberation, and relax the common assumption that noise and reverberation are uncorrelated with the source. Thus, our method does not require knowledge of the noise spectrum or a detailed model of the reverberation. Experimental results show that the proposed method is superior to the PHAT method.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132084544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging the Gap: Towards a Unified Framework for Hands-Free Speech Recognition Using Microphone Arrays","authors":"Michael L. Seltzer","doi":"10.1109/HSCMA.2008.4538698","DOIUrl":"https://doi.org/10.1109/HSCMA.2008.4538698","url":null,"abstract":"In this paper we describe two families of algorithms for hands-free speech recognition using microphone arrays. Enhancement-based approaches use a cascade of independent processing blocks to perform speech enhancement followed by speech recognition. We discuss the reasons why this approach may be sub-optimal and motivate the need for a solution that tightly integrates all processing blocks into a common unified framework. This leads to a second family of algorithms called unified approaches which considers all processing stages to be components of a single system that operates with the common goal of improved recognition accuracy. We describe several examples of such algorithms that have been shown to outperform more traditional signal-processing-based approaches. In doing so, we hope to convey the benefits of performing hands-free speech recognition in this manner and motivate further research in this area.","PeriodicalId":129827,"journal":{"name":"2008 Hands-Free Speech Communication and Microphone Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127855043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}