IEEE Trans. Speech Audio Process.最新文献

筛选
英文 中文
Multiple fundamental frequency estimation based on harmonicity and spectral smoothness 基于谐波性和谱平滑性的多基频估计
IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.815516
Anssi Klapuri
{"title":"Multiple fundamental frequency estimation based on harmonicity and spectral smoothness","authors":"Anssi Klapuri","doi":"10.1109/TSA.2003.815516","DOIUrl":"https://doi.org/10.1109/TSA.2003.815516","url":null,"abstract":"A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"373 1","pages":"804-816"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75121481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 352
Speech enhancement using 2-D Fourier transform 基于二维傅里叶变换的语音增强
IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.816063
I. Soon, S. Koh
{"title":"Speech enhancement using 2-D Fourier transform","authors":"I. Soon, S. Koh","doi":"10.1109/TSA.2003.816063","DOIUrl":"https://doi.org/10.1109/TSA.2003.816063","url":null,"abstract":"This paper presents an innovative way of using the two-dimensional (2-D) Fourier transform for speech enhancement. The blocking and windowing of the speech data for the 2-D Fourier transform are explained in detail. Several techniques of filtering in the 2-D Fourier transform domain are also proposed. They include magnitude spectral subtraction, 2-D Wiener filtering as well as a hybrid filter which effectively combines the one-dimensional (1-D) Wiener filter with the 2-D Wiener filter. The proposed hybrid filter compares favorably against other techniques using an objective test.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"42 1","pages":"717-724"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90668042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments CSA-BF:一种用于真实汽车环境下语音增强和识别的约束开关自适应波束形成器
IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818034
Xianxian Zhang, J. Hansen
{"title":"CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments","authors":"Xianxian Zhang, J. Hansen","doi":"10.1109/TSA.2003.818034","DOIUrl":"https://doi.org/10.1109/TSA.2003.818034","url":null,"abstract":"While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85264102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Approximately independent factors of speech using nonlinear symplectic transformation 用非线性辛变换近似独立的语音因子
IEEE Trans. Speech Audio Process. Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.814457
M. Omar, M. Hasegawa-Johnson
{"title":"Approximately independent factors of speech using nonlinear symplectic transformation","authors":"M. Omar, M. Hasegawa-Johnson","doi":"10.1109/TSA.2003.814457","DOIUrl":"https://doi.org/10.1109/TSA.2003.814457","url":null,"abstract":"This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"26 1","pages":"660-671"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80999206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Nonuniform oversampled filter banks for audio signal processing 音频信号处理的非均匀过采样滤波器组
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814412
Z. Cvetković, J. Johnston
{"title":"Nonuniform oversampled filter banks for audio signal processing","authors":"Z. Cvetković, J. Johnston","doi":"10.1109/TSA.2003.814412","DOIUrl":"https://doi.org/10.1109/TSA.2003.814412","url":null,"abstract":"In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"110 1","pages":"393-399"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74757716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Matching pursuits sinusoidal speech coding 匹配追求正弦语音编码
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815520
Ç. Etemoglu, V. Cuperman
{"title":"Matching pursuits sinusoidal speech coding","authors":"Ç. Etemoglu, V. Cuperman","doi":"10.1109/TSA.2003.815520","DOIUrl":"https://doi.org/10.1109/TSA.2003.815520","url":null,"abstract":"This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"413-424"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87930146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging 不利环境下的噪声谱估计:改进的最小控制递归平均
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.811544
I. Cohen
{"title":"Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging","authors":"I. Cohen","doi":"10.1109/TSA.2003.811544","DOIUrl":"https://doi.org/10.1109/TSA.2003.811544","url":null,"abstract":"Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 2 1","pages":"466-475"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78286955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 949
Quantization of LSF parameters using a trellis modeling 使用网格建模的LSF参数量化
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814411
F. Lahouti, A. Khandani
{"title":"Quantization of LSF parameters using a trellis modeling","authors":"F. Lahouti, A. Khandani","doi":"10.1109/TSA.2003.814411","DOIUrl":"https://doi.org/10.1109/TSA.2003.814411","url":null,"abstract":"An efficient block-based trellis quantization (BTQ) scheme is proposed for the quantization of the line spectral frequencies (LSF) in speech coding applications. The scheme is based on the modeling of the LSF intraframe dependencies with a trellis structure. The ordering property and the fact that LSF parameters are bounded within a range is explicitly incorporated in the trellis model. BTQ search and design algorithms are discussed and an efficient algorithm for the index generation (finding the index of a path in the trellis) is presented. Also the sequential vector decorrelation technique is presented to effectively exploit the intraframe correlation of LSF parameters within the trellis. Based on the proposed block-based trellis quantizer, two intraframe schemes and one interframe scheme are proposed. Comparisons to the split-VQ, the trellis coded quantization of LSF parameters, and the multi-stage VQ, as well as the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate that the proposed BTQ schemes outperform the above systems.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"54 1","pages":"400-412"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86595832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient text-independent speaker verification with structural Gaussian mixture models and neural network 基于结构高斯混合模型和神经网络的高效文本无关说话人验证
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815822
Bing Xiang, T. Berger
{"title":"Efficient text-independent speaker verification with structural Gaussian mixture models and neural network","authors":"Bing Xiang, T. Berger","doi":"10.1109/TSA.2003.815822","DOIUrl":"https://doi.org/10.1109/TSA.2003.815822","url":null,"abstract":"We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"43 1","pages":"447-456"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80338680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
A soft voice activity detector based on a Laplacian-Gaussian model 基于拉普拉斯-高斯模型的软语音活动检测器
IEEE Trans. Speech Audio Process. Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815518
S. Gazor, Wei Zhang
{"title":"A soft voice activity detector based on a Laplacian-Gaussian model","authors":"S. Gazor, Wei Zhang","doi":"10.1109/TSA.2003.815518","DOIUrl":"https://doi.org/10.1109/TSA.2003.815518","url":null,"abstract":"A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"82 1 1","pages":"498-505"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78486264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 157
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信