IEEE Trans. Speech Audio Process.最新文献_第6页

Multiple fundamental frequency estimation based on harmonicity and spectral smoothness 基于谐波性和谱平滑性的多基频估计

IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.815516

Anssi Klapuri

{"title":"Multiple fundamental frequency estimation based on harmonicity and spectral smoothness","authors":"Anssi Klapuri","doi":"10.1109/TSA.2003.815516","DOIUrl":"https://doi.org/10.1109/TSA.2003.815516","url":null,"abstract":"A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"373 1","pages":"804-816"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75121481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 352

Speech enhancement using 2-D Fourier transform 基于二维傅里叶变换的语音增强

IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.816063

I. Soon, S. Koh

引用次数: 58

CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments CSA-BF:一种用于真实汽车环境下语音增强和识别的约束开关自适应波束形成器

IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818034

Xianxian Zhang, J. Hansen

{"title":"CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments","authors":"Xianxian Zhang, J. Hansen","doi":"10.1109/TSA.2003.818034","DOIUrl":"https://doi.org/10.1109/TSA.2003.818034","url":null,"abstract":"While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85264102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Approximately independent factors of speech using nonlinear symplectic transformation 用非线性辛变换近似独立的语音因子

IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.814457

M. Omar, M. Hasegawa-Johnson

{"title":"Approximately independent factors of speech using nonlinear symplectic transformation","authors":"M. Omar, M. Hasegawa-Johnson","doi":"10.1109/TSA.2003.814457","DOIUrl":"https://doi.org/10.1109/TSA.2003.814457","url":null,"abstract":"This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"26 1","pages":"660-671"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80999206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Nonuniform oversampled filter banks for audio signal processing 音频信号处理的非均匀过采样滤波器组

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814412

Z. Cvetković, J. Johnston

引用次数: 60

Matching pursuits sinusoidal speech coding 匹配追求正弦语音编码

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815520

Ç. Etemoglu, V. Cuperman

{"title":"Matching pursuits sinusoidal speech coding","authors":"Ç. Etemoglu, V. Cuperman","doi":"10.1109/TSA.2003.815520","DOIUrl":"https://doi.org/10.1109/TSA.2003.815520","url":null,"abstract":"This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"413-424"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87930146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging 不利环境下的噪声谱估计:改进的最小控制递归平均

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.811544

I. Cohen

{"title":"Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging","authors":"I. Cohen","doi":"10.1109/TSA.2003.811544","DOIUrl":"https://doi.org/10.1109/TSA.2003.811544","url":null,"abstract":"Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 2 1","pages":"466-475"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78286955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 949

Quantization of LSF parameters using a trellis modeling 使用网格建模的LSF参数量化

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814411

F. Lahouti, A. Khandani

引用次数: 12

Efficient text-independent speaker verification with structural Gaussian mixture models and neural network 基于结构高斯混合模型和神经网络的高效文本无关说话人验证

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815822

Bing Xiang, T. Berger

{"title":"Efficient text-independent speaker verification with structural Gaussian mixture models and neural network","authors":"Bing Xiang, T. Berger","doi":"10.1109/TSA.2003.815822","DOIUrl":"https://doi.org/10.1109/TSA.2003.815822","url":null,"abstract":"We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"43 1","pages":"447-456"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80338680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

A soft voice activity detector based on a Laplacian-Gaussian model 基于拉普拉斯-高斯模型的软语音活动检测器

IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815518

S. Gazor, Wei Zhang

{"title":"A soft voice activity detector based on a Laplacian-Gaussian model","authors":"S. Gazor, Wei Zhang","doi":"10.1109/TSA.2003.815518","DOIUrl":"https://doi.org/10.1109/TSA.2003.815518","url":null,"abstract":"A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"82 1 1","pages":"498-505"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78486264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 157