IEEE Transactions on Audio Speech and Language Processing最新文献

筛选
英文 中文
A High-Quality Speech and Audio Codec With Less Than 10-ms Delay 一个高质量的语音和音频编解码器与小于10毫秒的延迟
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2016-02-17 DOI: 10.1109/TASL.2009.2023186
J. Valin, Timothy B. Terriberry, Christopher Montgomery, Gregory Maxwell
{"title":"A High-Quality Speech and Audio Codec With Less Than 10-ms Delay","authors":"J. Valin, Timothy B. Terriberry, Christopher Montgomery, Gregory Maxwell","doi":"10.1109/TASL.2009.2023186","DOIUrl":"https://doi.org/10.1109/TASL.2009.2023186","url":null,"abstract":"With increasing quality requirements for multimedia communications, audio codecs must maintain both high quality and low delay. Typically, audio codecs offer either low delay or high quality, but rarely both. We propose a codec that simultaneously addresses both these requirements, with a delay of only 8.7 ms at 44.1 kHz. It uses gain-shape algebraic vector quantization in the frequency domain with time-domain pitch prediction. We demonstrate that the proposed codec operating at 48 kb/s and 64 kb/s out-performs both G.722.1C and MP3 and has quality comparable to AAC-LD, despite having less than one fourth of the algorithmic delay of these codecs.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2009.2023186","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62852437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Dominance Based Integration of Spatial and Spectral Features for Speech Enhancement 基于优势度的语音增强空间与频谱特征集成
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2277937
T. Nakatani, S. Araki, Takuya Yoshioka, Marc Delcroix, M. Fujimoto
{"title":"Dominance Based Integration of Spatial and Spectral Features for Speech Enhancement","authors":"T. Nakatani, S. Araki, Takuya Yoshioka, Marc Delcroix, M. Fujimoto","doi":"10.1109/TASL.2013.2277937","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277937","url":null,"abstract":"This paper proposes a versatile technique for integrating two conventional speech enhancement approaches, a spatial clustering approach (SCA) and a factorial model approach (FMA), which are based on two different features of signals, namely spatial and spectral features, respectively. When used separately the conventional approaches simply identify time frequency (TF) bins that are dominated by interference for speech enhancement. Integration of the two approaches makes identification more reliable, and allows us to estimate speech spectra more accurately even in highly nonstationary interference environments. This paper also proposes extensions of the FMA for further elaboration of the proposed technique, including one that uses spectral models based on mel-frequency cepstral coefficients and another to cope with mismatches, such as channel mismatches, between captured signals and the spectral models. Experiments using simulated and real recordings show that the proposed technique can effectively improve audible speech quality and the automatic speech recognition score.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277937","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Linearly-Constrained Minimum-Variance Method for Spherical Microphone Arrays Based on Plane-Wave Decomposition of the Sound Field 基于声场平面波分解的球形传声器阵列线性约束最小方差法
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2277939
Yotam Peled, B. Rafaely
{"title":"Linearly-Constrained Minimum-Variance Method for Spherical Microphone Arrays Based on Plane-Wave Decomposition of the Sound Field","authors":"Yotam Peled, B. Rafaely","doi":"10.1109/TASL.2013.2277939","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277939","url":null,"abstract":"Speech signals recorded in real environments may be corrupted by ambient noise and reverberation. Therefore, noise reduction and dereverberation algorithms for speech enhancement are typically employed in speech communication systems. Although microphone arrays are useful in reducing the effect of noise and reverberation, existing methods have limited success in significantly removing both reverberation and noise in real environments. This paper presents a method for noise reduction and dereverberation that overcomes some of the limitations of previous methods. The method uses a spherical microphone array to achieve plane-wave decomposition (PWD) of the sound field, based on direction-of-arrival (DOA) estimation of the desired signal and its reflections. A multi-channel linearly-constrained minimum-variance (LCMV) filter is introduced to achieve further noise reduction. The PWD beamformer achieves dereverberation while the LCMV filter reduces the uncorrelated noise with a controllable dereverberation constraint. In contrast to other methods, the proposed method employs DOA estimation, rather than room impulse response identification, to achieve dereverberation, and relative transfer function (RTF) estimation between the source reflections to achieve noise reduction while avoiding signal cancellation. The paper includes a simulation investigation and an experimental study, comparing the proposed method to currently available methods.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277939","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A Class of Optimal Rectangular Filtering Matrices for Single-Channel Signal Enhancement in the Time Domain 一类用于时域单通道信号增强的最优矩形滤波矩阵
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2280215
J. Jensen, J. Benesty, M. G. Christensen, Jingdong Chen
{"title":"A Class of Optimal Rectangular Filtering Matrices for Single-Channel Signal Enhancement in the Time Domain","authors":"J. Jensen, J. Benesty, M. G. Christensen, Jingdong Chen","doi":"10.1109/TASL.2013.2280215","DOIUrl":"https://doi.org/10.1109/TASL.2013.2280215","url":null,"abstract":"In this paper, we introduce a new class of optimal rectangular filtering matrices for single-channel speech enhancement. The new class of filters exploits the fact that the dimension of the signal subspace is lower than that of the full space. By doing this, extra degrees of freedom in the filters, that are otherwise reserved for preserving the signal subspace, can be used for achieving an improved output signal-to-noise ratio (SNR). Moreover, the filters allow for explicit control of the tradeoff between noise reduction and speech distortion via the chosen rank of the signal subspace. An interesting aspect is that the framework in which the filters are derived unifies the ideas of optimal filtering and subspace methods. A number of different optimal filter designs are derived in this framework, and the properties and performance of these are studied using both synthetic, periodic signals and real signals. The results show a number of interesting things. Firstly, they show how speech distortion can be traded for noise reduction and vice versa in a seamless manner. Moreover, the introduced filter designs are capable of achieving both the upper and lower bounds for the output SNR via the choice of a single parameter.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2280215","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Investigations on an EM-Style Optimization Algorithm for Discriminative Training of HMMs hmm判别训练的em型优化算法研究
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2280234
G. Heigold, H. Ney, R. Schlüter
{"title":"Investigations on an EM-Style Optimization Algorithm for Discriminative Training of HMMs","authors":"G. Heigold, H. Ney, R. Schlüter","doi":"10.1109/TASL.2013.2280234","DOIUrl":"https://doi.org/10.1109/TASL.2013.2280234","url":null,"abstract":"Today's speech recognition systems are based on hidden Markov models (HMMs) with Gaussian mixture models whose parameters are estimated using a discriminative training criterion such as Maximum Mutual Information (MMI) or Minimum Phone Error (MPE). Currently, the optimization is almost always done with (empirical variants of) Extended Baum-Welch (EBW). This type of optimization requires sophisticated update schemes for the step sizes and a considerable amount of parameter tuning, and only little is known about its convergence behavior. In this paper, we derive an EM-style algorithm for discriminative training of HMMs. Like Expectation-Maximization (EM) for the generative training of HMMs, the proposed algorithm improves the training criterion on each iteration, converges to a local optimum, and is completely parameter-free. We investigate the feasibility of the proposed EM-style algorithm for discriminative training of two tasks, namely grapheme-to-phoneme conversion and spoken digit string recognition.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2280234","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Soundfield Imaging in the Ray Space 射线空间中的声场成像
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2274697
D. Markovic, F. Antonacci, A. Sarti, S. Tubaro
{"title":"Soundfield Imaging in the Ray Space","authors":"D. Markovic, F. Antonacci, A. Sarti, S. Tubaro","doi":"10.1109/TASL.2013.2274697","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274697","url":null,"abstract":"In this work we propose a general approach to acoustic scene analysis based on a novel data structure (ray-space image) that encodes the directional plenacoustic function over a line segment (Observation Window, OW). We define and describe a system for acquiring a ray-space image using a microphone array and refer to it as ray-space (or “soundfield”) camera. The method consists of acquiring the pseudo-spectra corresponding to a grid of sampling points over the OW, and remapping them onto the ray space, which parameterizes acoustic paths crossing the OW. The resulting ray-space image displays the information gathered by the sensors in such a way that the elements of the acoustic scene (sources and reflectors) will be easy to discern, recognize and extract. The key advantage of this method is that ray-space images, irrespective of the application, are generated by a common (and highly parallelizable) processing layer, and can be processed using methods coming from the extensive literature of pattern analysis. After defining the ideal ray-space image in terms of the directional plenacoustic function, we show how to acquire it using a microphone array. We also discuss resolution and aliasing issues and show two simple examples of applications of ray-space imaging.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274697","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Epoch Extraction Based on Integrated Linear Prediction Residual Using Plosion Index 基于爆炸指数综合线性预测残差的历元提取
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2273717
A. Prathosh, T. Ananthapadmanabha, A. Ramakrishnan
{"title":"Epoch Extraction Based on Integrated Linear Prediction Residual Using Plosion Index","authors":"A. Prathosh, T. Ananthapadmanabha, A. Ramakrishnan","doi":"10.1109/TASL.2013.2273717","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273717","url":null,"abstract":"Epoch is defined as the instant of significant excitation within a pitch period of voiced speech. Epoch extraction continues to attract the interest of researchers because of its significance in speech analysis. Existing high performance epoch extraction algorithms require either dynamic programming techniques or a priori information of the average pitch period. An algorithm without such requirements is proposed based on integrated linear prediction residual (ILPR) which resembles the voice source signal. Half wave rectified and negated ILPR (or Hilbert transform of ILPR) is used as the pre-processed signal. A new non-linear temporal measure named the plosion index (PI) has been proposed for detecting ‘transients’ in speech signal. An extension of PI, called the dynamic plosion index (DPI) is applied on pre-processed signal to estimate the epochs. The proposed DPI algorithm is validated using six large databases which provide simultaneous EGG recordings. Creaky and singing voice samples are also analyzed. The algorithm has been tested for its robustness in the presence of additive white and babble noise and on simulated telephone quality speech. The performance of the DPI algorithm is found to be comparable or better than five state-of-the-art techniques for the experiments considered.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 123
Body Conducted Speech Enhancement by Equalization and Signal Fusion 基于均衡和信号融合的身体传导语音增强
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2274696
Tomas Dekens, W. Verhelst
{"title":"Body Conducted Speech Enhancement by Equalization and Signal Fusion","authors":"Tomas Dekens, W. Verhelst","doi":"10.1109/TASL.2013.2274696","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274696","url":null,"abstract":"This paper studies body-conducted speech for noise robust speech processing purposes. As body-conducted speech is typically limited in bandwidth, signal processing is required to obtain a signal that is both high in quality and low in noise. We propose an algorithm that first equalizes the body-conducted speech using filters obtained from a pre-defined filter set and subsequently fuses this equalized signal with a noisy conventional microphone signal using an optimal clean speech amplitude and phase estimator. We evaluated the proposed equalization and fusion technique using a combination of a conventional close-talk and a throat microphone. Subjective listening tests show that the proposed method successfully fuses the speech quality of the conventional signal and the noise robustness of the throat microphone signal. The listening tests also indicate that the inclusion of the body-conducted signal can improve single-channel speech enhancement methods, while a calculated set of objective signal quality measures confirm these observations.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274696","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A Bag of Systems Representation for Music Auto-Tagging 音乐自动标注的一种系统表示
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2279318
Katherine Ellis, E. Coviello, Antoni B. Chan, Gert R. G. Lanckriet
{"title":"A Bag of Systems Representation for Music Auto-Tagging","authors":"Katherine Ellis, E. Coviello, Antoni B. Chan, Gert R. G. Lanckriet","doi":"10.1109/TASL.2013.2279318","DOIUrl":"https://doi.org/10.1109/TASL.2013.2279318","url":null,"abstract":"We present a content-based automatic tagging system for music that relies on a high-level, concise “Bag of Systems” (BoS) representation of the characteristics of a musical piece. The BoS representation leverages a rich dictionary of musical codewords, where each codeword is a generative model that captures timbral and temporal characteristics of music. Songs are represented as a BoS histogram over codewords, which allows for the use of traditional algorithms for text document retrieval to perform auto-tagging. Compared to estimating a single generative model to directly capture the musical characteristics of songs associated with a tag, the BoS approach offers the flexibility to combine different generative models at various time resolutions through the selection of the BoS codewords. Additionally, decoupling the modeling of audio characteristics from the modeling of tag-specific patterns makes BoS a more robust and rich representation of music. Experiments show that this leads to superior auto-tagging performance.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2279318","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays 基于几何的分布式麦克风阵列空间声音采集
IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2280210
O. Thiergart, G. D. Galdo, Maja Taseska, Emanuël Habets
{"title":"Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays","authors":"O. Thiergart, G. D. Galdo, Maja Taseska, Emanuël Habets","doi":"10.1109/TASL.2013.2280210","DOIUrl":"https://doi.org/10.1109/TASL.2013.2280210","url":null,"abstract":"Traditional spatial sound acquisition aims at capturing a sound field with multiple microphones such that at the reproduction side a listener can perceive the sound image as it was at the recording location. Standard techniques for spatial sound acquisition usually use spaced omnidirectional microphones or coincident directional microphones. Alternatively, microphone arrays and spatial filters can be used to capture the sound field. From a geometric point of view, the perspective of the sound field is fixed when using such techniques. In this paper, a geometry-based spatial sound acquisition technique is proposed to compute virtual microphone signals that manifest a different perspective of the sound field. The proposed technique uses a parametric sound field model that is formulated in the time-frequency domain. It is assumed that each time-frequency instant of a microphone signal can be decomposed into one direct and one diffuse sound component. It is further assumed that the direct component is the response of a single isotropic point-like source (IPLS) of which the position is estimated for each time-frequency instant using distributed microphone arrays. Given the sound components and the position of the IPLS, it is possible to synthesize a signal that corresponds to a virtual microphone at an arbitrary position and with an arbitrary pick-up pattern.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2280210","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信