{"title":"Prosodic and accentual information for automatic speech recognition","authors":"Diego H. Milone, A. Rubio","doi":"10.1109/TSA.2003.814368","DOIUrl":"https://doi.org/10.1109/TSA.2003.814368","url":null,"abstract":"Various aspects relating to the human production and perception of speech have gradually been incorporated into automatic speech recognition systems. Nevertheless, the set of speech prosodic features has not yet been used in an explicit way in the recognition process itself. This study presents an analysis of prosody's three most important parameters, namely energy, fundamental frequency and duration, together with a method for incorporating this information into automatic speech recognition. On the basis of a preliminary analysis, a design is proposed for a prosodic feature classifier in which these parameters are associated with orthographic accentuation. Prosodic-accentual features are incorporated in a hidden Markov model recognizer; their theoretical formulation and experimental setup are then presented. Several experiments were conducted to show how the method performs with a Spanish continuous-speech database. Using this approach to process other database subsets, we obtained a word recognition error reduction rate of 28.91%.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2016 1","pages":"321-333"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86125549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance limits in subband beamforming","authors":"S. Nordholm, I. Claesson, N. Grbic","doi":"10.1109/TSA.2003.811543","DOIUrl":"https://doi.org/10.1109/TSA.2003.811543","url":null,"abstract":"This paper analyzes subband beamforming schemes mainly aimed at speech enhancement and acoustic echo suppression applications such as hands-free telephony for both mobile and office environments, Internet telephony and video conferencing. Analytical descriptions of both causal finite-length and noncausal infinite-length subband microphone array structures are given. More specifically, this paper compares finite Wiener filter performance with the noncausal Wiener solution, giving a comprehensive theoretical suppression limit. It is shown that even short filters will yield a good approximation of the infinite solution, provided that the element spacing and temporal sampling is matched to the frequency band of interest. Typically, 10-20 FIR taps are sufficient in each subband.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"20 1","pages":"193-203"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80592352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized digital waveguide networks","authors":"D. Rocchesso, J. Smith","doi":"10.1109/TSA.2003.811541","DOIUrl":"https://doi.org/10.1109/TSA.2003.811541","url":null,"abstract":"Digital waveguides are generalized to the multivariable case with the goal of maximizing generality while retaining robust numerical properties and simplicity of realization. Multivariable complex power is defined, and conditions for \"medium passivity\" are presented. Multivariable complex wave impedances, such as those deriving from multivariable lossy waveguides, are used to construct scattering junctions which yield frequency dependent scattering coefficients which can be implemented in practice using digital filters. The general form for the scattering matrix at a junction of multivariable waveguides is derived. An efficient class of loss-modeling filters is derived, including a rule for checking validity of the small-loss assumption. An example application in musical acoustics is given.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"66 1","pages":"242-254"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73738358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finite difference schemes and digital waveguide networks for the wave equation: stability, passivity, and numerical dispersion","authors":"S. Bilbao, J. Smith","doi":"10.1109/TSA.2003.811535","DOIUrl":"https://doi.org/10.1109/TSA.2003.811535","url":null,"abstract":"In this paper, some simple families of explicit two-step finite difference methods for solving the wave equation in two and three spatial dimensions are examined. These schemes depend on several free parameters, and can be associated with so-called interpolated digital waveguide meshes. Special attention is paid to the stability properties of these schemes (in particular the bounds on the space-step/time-step ratio) and their relationship with the passivity condition on the related digital waveguide networks. Boundary conditions are also discussed. An analysis of the directional numerical dispersion properties of these schemes is provided, and minimally directionally-dispersive interpolated digital waveguide meshes are constructed.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2009 1","pages":"255-266"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78581867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distortion discriminant analysis for audio fingerprinting","authors":"C. Burges, John C. Platt, S. Jana","doi":"10.1109/TSA.2003.811538","DOIUrl":"https://doi.org/10.1109/TSA.2003.811538","url":null,"abstract":"Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction operation must be computationally efficient. We propose distortion discriminant analysis (DDA), which fulfills all four of these requirements. DDA constructs a linear, convolutional neural network out of layers, each of which performs an oriented PCA dimensional reduction. We demonstrate the effectiveness of DDA on two audio fingerprinting tasks: searching for 500 audio clips in 36 h of audio test data; and playing over 10 days of audio against a database with approximately 240 000 fingerprints. We show that the system is robust to kinds of noise that are not present in the training procedure. In the large test, the system gives a false positive rate of 1.5 /spl times/ 10/sup -8/ per audio clip, per fingerprint, at a false negative rate of 0.2% per clip.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"33 1","pages":"165-174"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80139364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SNR estimation based on amplitude modulation analysis with applications to noise suppression","authors":"J. Tchorz, B. Kollmeier","doi":"10.1109/TSA.2003.811542","DOIUrl":"https://doi.org/10.1109/TSA.2003.811542","url":null,"abstract":"A single-microphone noise suppression algorithm is described that is based on a novel approach for the estimation of the signal-to-noise ratio (SNR) in different frequency channels: The input signal is transformed into neurophysiologically-motivated spectro-temporal input features. These patterns are called amplitude modulation spectrograms (AMS), as they contain information of both center frequencies and modulation frequencies within each 32 ms-analysis frame. The different representations of speech and noise in AMS patterns are detected by a neural network, which estimates the present SNR in each frequency channel. Quantitative experiments show a reliable estimation of the SNR for most types of nonspeech background noise. For noise suppression, the frequency bands are attenuated according to the estimated present SNR using a Wiener filter approach. Objective speech quality measures, informal listening tests, and the results of automatic speech recognition experiments indicate a substantial benefit from AMS-based noise suppression, in comparison to unprocessed noisy speech.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"184-192"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85986442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recursive identification of acoustic echo systems using orthonormal basis functions","authors":"Lester S. H. Ngia","doi":"10.1109/TSA.2003.811536","DOIUrl":"https://doi.org/10.1109/TSA.2003.811536","url":null,"abstract":"In hands-free telephone or video conference application, there exists an acoustic feedback coupling between the loudspeaker and microphone in an enclosed environment, which creates the acoustic echo. FIR filters are commonly used in acoustic echo cancellers because of their simple structure. However, in this paper, the Kautz and Laguerre filter structures are shown to be more efficient echo cancellers than the FIR filters, because they can describe accurately the acoustic echo system with fewer parameters. These filters are built from their respective orthonormal Kautz and Laguerre basis functions. The proposal is motivated by some theoretical and numerical results that the time-varying acoustic echo path is basically due to its time-varying zeros and not its time-invariant acoustical poles. Therefore, the poles of the Kautz and the Laguerre filters are estimated, and can be kept fixed or updated occasionally if required. The poles are estimated by a batch Gauss-Newton algorithm. Then, the coefficients of the Kautz and Laguerre filters can be estimated by most recursive algorithms that are suitable for linear regression models, e.g., the normalized LMS algorithm. Generally, it is shown that the proposed Kautz and Laguerre filters, as the filter structures in an acoustic echo canceller, have better convergence and tracking properties than the FIR and IIR filters.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"44 1","pages":"278-293"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74181143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multipitch tracking algorithm for noisy speech","authors":"Mingyang Wu, Deliang Wang, Guy J. Brown","doi":"10.1109/TSA.2003.811539","DOIUrl":"https://doi.org/10.1109/TSA.2003.811539","url":null,"abstract":"An effective multipitch tracking algorithm for noisy speech is critical for acoustic signal processing. However, the performance of existing algorithms is not satisfactory. We present a robust algorithm for multipitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model (HMM) for forming continuous pitch tracks. The resulting algorithm can reliably track single and double pitch tracks in a noisy environment. We suggest a pitch error measure for the multipitch situation. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. Quantitative comparisons show that our algorithm significantly outperforms existing ones.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"229-241"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81004633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An enhanced dynamic time warping model for improved estimation of DTW parameters","authors":"R. Yaniv, D. Burshtein","doi":"10.1109/TSA.2003.811540","DOIUrl":"https://doi.org/10.1109/TSA.2003.811540","url":null,"abstract":"We introduce an enhanced dynamic time warping model (EDTW) which, unlike conventional dynamic time warping (DTW), considers all possible alignment paths for recognition as well as for parameter estimation. The model, for which DTW and the hidden Markov model (HMM) are special cases, is based on a well-defined quality measure. We extend the derivation of the Forward and Viterbi algorithms for HMMs, in order to obtain efficient solutions for the problems of recognition and optimal path alignment in the new proposed model. We then extend the Baum-Welch (1972) estimation algorithm for HMMs and obtain an iterative method for estimating the model parameters of the new model based on the Baum inequality. This estimation method efficiently considers all possible alignment paths between the training data and the current model. A standard segmental K-means estimation algorithm is also derived for EDTW. We compare the performance of the two training algorithms, with various path movement constraints, in two isolated letter recognition tasks. The new estimation algorithm was found to improve performance over segmental K-means in most experiments.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"216-228"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73673552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Natural language spoken interface control using data-driven semantic inference","authors":"J. Bellegarda, Kim E. A. Silverman","doi":"10.1109/TSA.2003.811534","DOIUrl":"https://doi.org/10.1109/TSA.2003.811534","url":null,"abstract":"Spoken interaction tasks are typically approached using a formal grammar as language model. While ensuring good system performance, this imposes a rigid framework on users, by implicitly forcing them to conform to a pre-defined interaction structure. This paper introduces the concept of data-driven semantic inference, which in principle allows for any word constructs in command/query formulation. Each unconstrained word string is automatically mapped onto the intended action through a semantic classification against the set of supported actions. As a result, it is no longer necessary for users to memorize the exact syntax of every command. The underlying (latent semantic analysis) framework relies on co-occurrences between words and commands, as observed in a training corpus. A suitable extension can also handle commands that are ambiguous at the word level. The behavior of semantic inference is characterized using a desktop user interface control task involving 113 different actions. Under realistic usage conditions, this approach exhibits a 2 to 5% classification error rate. Various training scenarios of increasing scope are considered to assess the influence of coverage on performance. Sufficient semantic knowledge about the task domain is found to be captured at a level of coverage as low as 70%. This illustrates the good generalization properties of semantic inference.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"28 1","pages":"267-277"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75942789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}