{"title":"Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition","authors":"C. Hsu, Lin-Shan Lee","doi":"10.1109/TASL.2008.2006575","DOIUrl":"https://doi.org/10.1109/TASL.2008.2006575","url":null,"abstract":"Cepstral normalization has widely been used as a powerful approach to produce robust features for speech recognition. Good examples of this approach include cepstral mean subtraction, and cepstral mean and variance normalization, in which either the first or both the first and the second moments of the Mel-frequency cepstral coefficients (MFCCs) are normalized. In this paper, we propose the family of higher order cepstral moment normalization, in which the MFCC parameters are normalized with respect to a few moments of orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. The fundamental principles behind this approach are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters. Experimental results based on the AURORA 2, AURORA 3, AURORA 4, and Resource Management (RM) testing environments show that with the proposed approach, recognition accuracy can be significantly and consistently improved for all types of noise and all SNR conditions.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"205-220"},"PeriodicalIF":0.0,"publicationDate":"2009-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74259698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haibin Huang, P. Fränti, Dong-Yan Huang, S. Rahardja
{"title":"Cascaded RLS-LMS Prediction in MPEG-4 Lossless Audio Coding","authors":"Haibin Huang, P. Fränti, Dong-Yan Huang, S. Rahardja","doi":"10.1109/TASL.2007.911675","DOIUrl":"https://doi.org/10.1109/TASL.2007.911675","url":null,"abstract":"This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the input signal. A linear combiner adds up the intermediate estimates at the output of each prediction stage to give a final estimate of the RLS-LMS predictor. In the RLS-LMS predictor, the first prediction stage is a simple first-order predictor with a fixed coefficient value 1. The second prediction stage uses the recursive least square algorithm to adaptively update the predictor coefficients. The subsequent prediction stages use the normalized least mean square algorithm to update the predictor coefficients. The coefficients of the linear combiner are then updated using the sign-sign least mean square algorithm. For stereo audio signals, the RLS-LMS predictor uses both intrachannel prediction and interchannel prediction, which results in a 3% improvement in compression ratio over using only the intrachannel prediction. Through extensive tests, the MPEG-4 Audio Lossless coder using the RLS-LMS predictor has demonstrated a compression ratio that is on par with the best lossless audio coders in the field. In this paper, the structure of the RLS-LMS predictor is described in detail, and the optimal predictor configuration is studied through various experiments.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"32 1","pages":"554-562"},"PeriodicalIF":0.0,"publicationDate":"2008-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77222643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comments on Vocal Tract Length Normalization Equals Linear Transformation in Cepstral Space","authors":"M. Afify, O. Siohan","doi":"10.1109/TASL.2007.896653","DOIUrl":"https://doi.org/10.1109/TASL.2007.896653","url":null,"abstract":"The bilinear transformation (BT) is used for vocal tract length normalization (VTLN) in speech recogniton systems. We prove two properties of the bilinear mapping that motivated the band-diagonal transform proposed in M. Afify and O. Siohan, (ldquoConstrained maximum likelihood linear regression for speaker adaptation,rdquo in Proc. ICSLP, Beijing, China, Oct. 2000.) This is in contrast to what is stated in M. Pitz and H. Ney, (ldquoVocal tract length normalization equals linear transformation in cepstral space,rdquo IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp 930-944, September 2005) that the transform of Afify and Siohan was motivated by empirical observations.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"23 1","pages":"1731-1732"},"PeriodicalIF":0.0,"publicationDate":"2007-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76538899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Lempel-Ziv Compression for Audio","authors":"D. Kirovski, Zeph Landau","doi":"10.1109/TASL.2006.881687","DOIUrl":"https://doi.org/10.1109/TASL.2006.881687","url":null,"abstract":"We introduce a novel compression paradigm to generalize a class of Lempel-Ziv algorithms for lossy compression of multimedia. Based upon the fact that music, in particular electronically generated sound, has substantial level of repetitiveness within a single clip, we generalize the basic Lempel-Ziv compression algorithm to support representing a single window of audio using a linear combination of filtered past windows. In this positioning paper, we present a detailed overview of the new lossy compression paradigm, we identify the basic challenges such as similarity search and present preliminary experimental results on a benchmark of electronically generated musical pieces","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"29 1","pages":"509-518"},"PeriodicalIF":0.0,"publicationDate":"2007-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79203937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple change-point audio segmentation and classification using an MDL-based Gaussian model","authors":"Chung-Hsien Wu, Chia-Hsin Hsieh","doi":"10.1109/TSA.2005.852988","DOIUrl":"https://doi.org/10.1109/TSA.2005.852988","url":null,"abstract":"This study presents an approach for segmenting and classifying an audio stream based on audio type. First, a silence deletion procedure is employed to remove silence segments in the audio stream. A minimum description length (MDL)-based Gaussian model is then proposed to statistically characterize the audio features. Audio segmentation segments the audio stream into a sequence of homogeneous subsegments using the MDL-based Gaussian model. A hierarchical threshold-based classifier is then used to classify each subsegment into different audio types. Finally, a heuristic method is adopted to smooth the subsegment sequence and provide the final segmentation and classification results. Experimental results indicate that for TDT-3 news broadcast, a missed detection rate (MDR) of 0.1 and a false alarm rate (FAR) of 0.14 were achieved for audio segmentation. Given the same MDR and FAR values, segment-based audio classification achieved a better classification accuracy of 88% compared to a clip-based approach.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"524 1","pages":"647-657"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77874467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hain, P. Woodland, G. Evermann, M. Gales, Xunying Liu, G. Moore, Daniel Povey, Lan Wang
{"title":"Corrections to \"Automatic Transcription of Conversational Telephone Speech\"","authors":"T. Hain, P. Woodland, G. Evermann, M. Gales, Xunying Liu, G. Moore, Daniel Povey, Lan Wang","doi":"10.1109/TASL.2006.871051","DOIUrl":"https://doi.org/10.1109/TASL.2006.871051","url":null,"abstract":"Manuscript received December 9, 2003; August 9, 2004. This work was supported by GCHQ and by DARPA under Grant MDA972–02–0013. This paper does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Geoffrey Zweig. The authors are with the Cambridge University Engineering Department, Cambridge CB2 1PZ, U.K. (e-mail: pcw@eng.cam.ac.uk). Digital Object Identifier 10.1109/TASL.2006.871051","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"52 1","pages":"727-727"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79824690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corrections to \"Segmental minimum Bayes-risk decoding for automatic speech recognition\"","authors":"V. Goel, Shankar Kumar, W. Byrne","doi":"10.1109/TSA.2005.854087","DOIUrl":"https://doi.org/10.1109/TSA.2005.854087","url":null,"abstract":"The purpose of this paper is to correct and expand upon the experimental results presented in our recently published paper [1]. In [1, Sec. III-B], we present a risk-based lattice cutting (RLC) procedure to segment ASR word lattices into sequences of smaller sublattices. The purpose of this procedure is to restructure the original lattice to improve the efficiency of minimum Bayes-risk (MBR) and other lattice rescoring procedures. Given that the segmented lattices are to be rescored, it is crucial that no paths from the original lattice be lost in the segmentation process. In the experiments reported in our original publication, some of the original paths were inadvertently discarded from the segmented lattices. This affected the performance of the MBR results presented. In this paper, we briefly review the segmentation algorithm and explain the flaw in our previous experiments. We find consistent minor improvements in word error rate (WER) under the corrected procedure. More importantly, we report experiments confirming that the lattice segmentation procedure does indeed preserve all the paths in the original lattice.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"94 1","pages":"356-357"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79935234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggregate a posteriori linear regression adaptation","authors":"Jen-Tzung Chien, Chih-Hsien Huang","doi":"10.1109/TSA.2005.860847","DOIUrl":"https://doi.org/10.1109/TSA.2005.860847","url":null,"abstract":"We present a new discriminative linear regression adaptation algorithm for hidden Markov model (HMM) based speech recognition. The cluster-dependent regression matrices are estimated from speaker-specific adaptation data through maximizing the aggregate a posteriori probability, which can be expressed in a form of classification error function adopting the logarithm of posterior distribution as the discriminant function. Accordingly, the aggregate a posteriori linear regression (AAPLR) is developed for discriminative adaptation where the classification errors of adaptation data are minimized. Because the prior distribution of regression matrix is involved, AAPLR is geared with the Bayesian learning capability. We demonstrate that the difference between AAPLR discriminative adaptation and maximum a posteriori linear regression (MAPLR) adaptation is due to the treatment of the evidence. Different from minimum classification error linear regression (MCELR), AAPLR has closed-form solution to fulfil rapid adaptation. Experimental results reveal that AAPLR speaker adaptation does improve speech recognition performance with moderate computational cost compared to maximum likelihood linear regression (MLLR), MAPLR, MCELR and conditional maximum likelihood linear regression (CMLLR). These results are verified for supervised adaptation as well as unsupervised adaptation for different numbers of adaptation data.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"12 5 1","pages":"797-807"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78635503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Rix, J. Beerends, Doh-Suk Kim, P. Kroon, O. Ghitza
{"title":"Objective Assessment of Speech and Audio Quality - Technology and Applications","authors":"A. Rix, J. Beerends, Doh-Suk Kim, P. Kroon, O. Ghitza","doi":"10.1109/TASL.2006.883260","DOIUrl":"https://doi.org/10.1109/TASL.2006.883260","url":null,"abstract":"In the past few years, objective quality assessment models have become increasingly used for assessing or monitoring speech and audio quality. By measuring perceived quality on an easily-understood subjective scale, such as listening quality (excellent, good, fair, poor, bad), these methods provide a quick and repeatable way to estimate customer experience. Typical applications include audio quality evaluation, selection of codecs or other equipment, and measuring the quality of telephone networks. To introduce this special issue, this paper provides an overview of the field, outlining the main approaches to intrusive, nonintrusive and parametric models and discussing some of their limitations and areas of future work","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"36 1","pages":"1890-1901"},"PeriodicalIF":0.0,"publicationDate":"2006-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88067753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Data Mining of Speech, Audio, and Dialog","authors":"M. Gilbert, Roger K. Moore, G. Zweig","doi":"10.1109/TSA.2005.852677","DOIUrl":"https://doi.org/10.1109/TSA.2005.852677","url":null,"abstract":"ATA mining is concerned with the science, technology, and engineering of discovering patterns and extracting potentially useful or interesting information automatically or semi-automatically from data. Data mining was introduced in the 1990s and has deep roots in the fields of statistics, artificial intelligence, and machine learning. With the advent of inexpensive storage space and faster processing over the past decade or so, data mining research has started to penetrate new grounds in areas of speech and audio processing as well as spoken language dialog. It has been fueled by the influx of audio data that are becoming more widely available from a variety of multimedia sources including webcasts, conversations, music, meetings, voice messages, lectures, television, and radio. Algorithmic advances in automatic speech recognition have also been a major, enabling technology behind the growth in data mining. Current state-of-the-art, large-vocabulary, continuous speech recognizers are now trained on a record amount of data—several hundreds of millions of words and thousands of hours of speech. Pioneering research in robust speech processing, large-scale discriminative training, finite state automata, and statistical hidden Markov modeling have resulted in real-time recognizers that are able to transcribe spontaneous speech with a word accuracy exceeding 85%. With this level of accuracy, the technology is now highly attractive for a variety of speech mining applications. Speech mining research includes many ways of applying machine learning, speech processing, and language processing algorithms to benefit and serve commercial applications. It also raises and addresses several new and interesting fundamental research challenges in the areas of prediction, search, explanation, learning, and language understanding. These basic challenges are becoming increasingly important in revolutionizing business processes by providing essential sales and marketing information about services, customers, and product offerings. They are also enabling a new class of learning systems to be created that can infer knowledge and trends automatically from data, analyze and report application performance, and adapt and improve over time with minimal or zero human involvement. Effective techniques for mining speech, audio, and dialog data can impact numerous business and government applications. The technology for monitoring conversational speech to discover patterns, capture useful trends, and generate alarms is essential for intelligence and law enforcement organizations as well as for enhancing call center operation. It is useful for an","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"70 1","pages":"633-634"},"PeriodicalIF":0.0,"publicationDate":"2005-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83915576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}