{"title":"ANIQUE: An Auditory Model for Single-Ended Speech Quality Estimation","authors":"Doh-Suk Kim","doi":"10.1109/TSA.2005.851924","DOIUrl":"https://doi.org/10.1109/TSA.2005.851924","url":null,"abstract":"In predicting subjective quality of speech signal degraded by telecommunication networks, conventional objective models require a reference source speech signal, which is applied as an input to the network, as well as the degraded speech. Non-intrusive estimation of speech quality is a challenging problem in that only the degraded speech signal is available. Non-intrusive estimation can be used in many real applications when source speech signal is not available. In this paper, we propose a new approach for non-intrusive speech quality estimation utilizing the temporal envelope representation of speech. The proposed auditory non-intrusive quality estimation (ANIQUE) model is based on the functional roles of human auditory systems and the characteristics of human articulation systems. Experimental evaluations on 35 different tests demonstrated the effectiveness of the proposed model.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"153 1","pages":"821-831"},"PeriodicalIF":0.0,"publicationDate":"2005-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86039932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combination of autocorrelation-based features and projection measure technique for speaker identification","authors":"Kuo-Hwei Yuo, Tai-Hwei Hwang, Hsiao-Chuan Wang","doi":"10.1109/TSA.2005.848893","DOIUrl":"https://doi.org/10.1109/TSA.2005.848893","url":null,"abstract":"This paper presents a robust approach for speaker identification when the speech signal is corrupted by additive noise and channel distortion. Robust features are derived by assuming that the corrupting noise is stationary and the channel effect is fixed during an utterance. A two-step temporal filtering procedure on the autocorrelation sequence is proposed to minimize the effect of additive and convolutional noises. The first step applies a temporal filtering procedure in autocorrelation domain to remove the additive noise, and the second step is to perform the mean subtraction on the filtered autocorrelation sequence in logarithmic spectrum domain to remove the channel effect. No prior knowledge of noise characteristic is necessary. The additive noise can be a colored noise. Then the proposed robust feature is combined with the projection measure technique to gain further improvement in recognition accuracy. Experimental results show that the proposed method can significantly improve the performance of speaker identification task in noisy environment.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"34 1","pages":"565-574"},"PeriodicalIF":0.0,"publicationDate":"2005-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91163988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rapid online adaptation based on transformation space model evolution","authors":"Dong Kook Kim, N. Kim","doi":"10.1109/TSA.2004.841427","DOIUrl":"https://doi.org/10.1109/TSA.2004.841427","url":null,"abstract":"This paper presents a new approach to online linear regression adaptation of continuous density hidden Markov models based on transformation space model (TSM) evolution. The TSM which characterizes the a priori knowledge of the training speakers associated with maximum likelihood linear regression matrix parameters is effectively described in terms of the latent variable models such as the factor analysis or probabilistic principal component analysis. The TSM provides various sources of information such as the correlation information, the prior distribution, and the prior knowledge of the regression parameters that are very useful for rapid adaptation. The quasi-Bayes estimation algorithm is formulated to incrementally update the hyperparameters of the TSM and regression matrices simultaneously. The proposed TSM evolution is a general framework with batch TSM adaptation as a special case. Experiments on supervised speaker adaptation demonstrate that the proposed approach is more effective compared with the conventional quasi-Bayes linear regression technique when a small amount of adaptation data is available.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"67 1","pages":"194-202"},"PeriodicalIF":0.0,"publicationDate":"2005-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83437925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Crosstalk resilient interference cancellation in microphone arrays using Capon beamforming","authors":"Wing-Kin Ma, P. Ching, B. Vo","doi":"10.1109/TSA.2004.833011","DOIUrl":"https://doi.org/10.1109/TSA.2004.833011","url":null,"abstract":"This paper studies a reference-assisted approach for interference canceling (IC) in microphone array systems. Conventionally, reference-assisted IC is based on the zero crosstalk assumption; i.e., when the desired source signal is absent in the reference microphones. In applications where crosstalk is inevitable, the conventional IC approach usually exhibits degraded performance due to cancellation of the desired signal. In this paper, we develop a crosstalk resilient IC method based on the Capon beamforming technique. The proposed beamformer deals with the uncertainty of crosstalk by applying a constraint on the worst-case crosstalk magnitude. The proposed beamformer not only performs IC, it also provides blind beamforming of the desired signal. We show that a blind beamformer based on the traditional minimum-mean-square-error (MMSE) IC method is a special case of the proposed beamformer. One key step of implementing the proposed Capon beamformer lies in solving a difficult nonconvex optimization problem, and we illustrate how the Capon optimal solution can be effectively approximated using the so-called semidefinite relaxation algorithm. Simulation results demonstrate that the proposed beamformer is more robust against crosstalk-induced signal cancellation than beamformers based on the MMSE-IC methods.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2 1","pages":"468-477"},"PeriodicalIF":0.0,"publicationDate":"2004-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76642765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Multichannel Signal Processing for Audio and Acoustics Applications","authors":"Walter Kellermann, M. Sondhi, D. DeVries","doi":"10.1109/TSA.2004.833716","DOIUrl":"https://doi.org/10.1109/TSA.2004.833716","url":null,"abstract":"HE IEEE Signal Processing Society has its roots in an area where acoustics, speech, and signal processing converge, as was reflected in the former name of the society when it was founded in 1974. The interface between acoustics, speech, and signal processing is still an area of great interest to the society, with many fundamental problems still unsolved. Research is driven by applications where acoustic signals have to be captured, transmitted, and/or reproduced in an acoustic environment that includes echoes, noise, and reverberation Considering human/machine interfaces as a major area of applications, it is obvious that signal processing becomes more challenging as the distance between humans and the machines increases, as the signal bandwidth increases, and as the acoustic environment becomes more complex and hostile. Increasingly sophisticated algorithms have been developed since the mid-1970s and along with the availability of greatly increased and affordable computational power, multichannel signal processing algorithms naturally evolved for exploiting the spatial dimension of acoustic signals. The importance and popularity of this field was well reflected by the large number of submissions to this special issue. The volume of high-quality papers could not be fitted into the page budget allotted to us. Thus, we regrettably had to decide to publish some of them in a second section of this special issue as part of a regular issue of the TRANSACTIONS in early 2005. For sound reproduction, where we want to provide a pair of desired signals at the listeners’ ear drums, seamless human/machine interfaces based on multichannel techniques have been implemented since the invention of stereo systems. However, providing the true spatial sound experience in large listening spaces became possible only with new multichannel signal processing techniques, such as wavefield synthesis. Still, major challenges remain, especially phase-true equalization of listening room acoustics and the cancellation of local noise sources and interferers. On the other hand, acquisition of audio and speech signals has been a research topic since the invention of the microphone and still today presents major challenges for the signal processing community. Structurally the simplest problem, the acoustic feedback from loudspeakers into microphones is addressed by acoustic echo cancellation: From the single-channel case which has been investigated since the 1970s, research has moved on to stereo and multichannel reproduction, recently culminating in a new wave-domain adaptive filtering concept which has been presented for the first time at ICASSP 2004. For removing unwanted interference and noise from desired signals, multichannel techniques utilize spatial diversity to discriminate between desired and undesired components, either by exploiting different spatial coherence properties or by beamforming, which directs a beam of increased sensitivity towards the desired source. For tr","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"11 1","pages":"449-450"},"PeriodicalIF":0.0,"publicationDate":"2004-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87052996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Furui, M. Beckman, Julia Hirschberg, S. Itahashi, Tatsuya Kawahara, Satoshi Nakamura, Shrikanth S. Narayanan
{"title":"Introduction to the Special Issue on Spontaneous Speech Processing","authors":"S. Furui, M. Beckman, Julia Hirschberg, S. Itahashi, Tatsuya Kawahara, Satoshi Nakamura, Shrikanth S. Narayanan","doi":"10.1109/TSA.2004.828628","DOIUrl":"https://doi.org/10.1109/TSA.2004.828628","url":null,"abstract":"","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"5 1","pages":"349-350"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72895375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Source localization in reverberant environments: modeling and statistical analysis","authors":"T. Gustafsson, B. Rao, M. Trivedi","doi":"10.1109/TSA.2003.818027","DOIUrl":"https://doi.org/10.1109/TSA.2003.818027","url":null,"abstract":"Room reverberation is typically the main obstacle for designing robust microphone-based source localization systems. The purpose of the paper is to analyze the achievable performance of acoustical source localization methods when room reverberation is present. To facilitate the analysis, we apply well known results from room acoustics to develop a simple but useful statistical model for the room transfer function. The properties of the statistical model are found to correlate well with results from real data measurements. The room transfer function model is further applied to analyze the statistical properties of some existing methods for source localization. In this respect we consider especially the asymptotic error variance and the probability of an anomalous estimate. A noteworthy outcome of the analysis is that the so-called PHAT time-delay estimator is shown to be optimal among a class of cross-correlation based time-delay estimators. To verify our results on the error variance and the outlier probability we apply the image method for simulation of the room transfer function.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"144 1","pages":"791-803"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73441635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust time delay estimation exploiting redundancy among multiple microphones","authors":"Jingdong Chen, J. Benesty, Yiteng Huang","doi":"10.1109/TSA.2003.818025","DOIUrl":"https://doi.org/10.1109/TSA.2003.818025","url":null,"abstract":"To find the position of an acoustic source in a room, typically, a set of relative delays among different microphone pairs needs to be determined. The generalized cross-correlation (GCC) method is the most popular to do so and is well explained in a landmark paper by Knapp and Carter. In this paper, the idea of cross-correlation coefficient between two random signals is generalized to the multichannel case by using the notion of spatial prediction. The multichannel spatial correlation matrix is then deduced and its properties are discussed. We then propose a new method based on the multichannel spatial correlation matrix for time delay estimation. It is shown that this new approach can take advantage of the redundancy when more than two microphones are available and this redundancy can help the estimator to better cope with noise and reverberation.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"549-557"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88299346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust recognition of children's speech","authors":"A. Potamianos, Shrikanth S. Narayanan","doi":"10.1109/TSA.2003.818026","DOIUrl":"https://doi.org/10.1109/TSA.2003.818026","url":null,"abstract":"Developmental changes in speech production introduce age-dependent spectral and temporal variability in the speech signal produced by children. Such variabilities pose challenges for robust automatic recognition of children's speech. Through an analysis of age-related acoustic characteristics of children's speech in the context of automatic speech recognition (ASR), effects such as frequency scaling of spectral envelope parameters are demonstrated. Recognition experiments using acoustic models trained from adult speech and tested against speech from children of various ages clearly show performance degradation with decreasing age. On average, the word error rates are two to five times worse for children speech than for adult speech. Various techniques for improving ASR performance on children's speech are reported. A speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability and significantly improve ASR performance for children speakers (by 25-45% under various model training and testing conditions). The use of age-dependent acoustic models further reduces word error rate by 10%. The potential of using piece-wise linear and phoneme-dependent frequency warping algorithms for reducing the variability in the acoustic feature space of children is also investigated.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"7 1","pages":"603-616"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76983919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}