Eurasip Journal on Audio Speech and Music Processing最新文献

Singing to speech conversion with generative flow. 用生成流演唱到言语的转换。

IF 1.7 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2025-01-01 Epub Date: 2025-03-10 DOI: 10.1186/s13636-025-00400-x

Jiawen Huang, Emmanouil Benetos

引用次数: 0

Robust and early howling detection based on a sparsity measure. 基于稀疏度度量的鲁棒和早期嚎叫检测。

IF 1.7 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2025-01-01 Epub Date: 2025-03-27 DOI: 10.1186/s13636-025-00399-1

Mina Mounir, Giuliano Bernardi, Toon van Waterschoot

{"title":"Robust and early howling detection based on a sparsity measure.","authors":"Mina Mounir, Giuliano Bernardi, Toon van Waterschoot","doi":"10.1186/s13636-025-00399-1","DOIUrl":"https://doi.org/10.1186/s13636-025-00399-1","url":null,"abstract":"Despite recent advances in audio technology, acoustic feedback remains a problem encountered in many sound reinforcement applications, ranging from public address systems to hearing aids. Acoustic feedback occurs due to the acoustic coupling between a loudspeaker and microphone, creating a closed-loop system that may become unstable and produce an acoustic artifact referred to as howling. One solution to the acoustic feedback problem, known as notch-filter-based howling suppression (NHS), consists in detecting and suppressing howling components hence stabilizing the closed-loop system and removing audible howling artifacts. The key component of any NHS method is howling detection (HD), which is typically based on the calculation of temporal and/or spectral features that allow to discriminate howling from desired audio signal components. In this paper, three contributions to HD research are presented. Firstly, we propose a novel howling detection feature, coined as NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -Transposed (NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -T), that exploits the particular time-frequency structure of a howling artifact. The NINOS <math><mmultiscripts><mrow></mrow> <mrow></mrow> <mn>2</mn></mmultiscripts> </math> -T feature is shown to outperform common state-of-the-art HD features, to be more robust to detection threshold variations, and to allow for the detection of early howling and ringing by discarding the often used concept of howling candidates selection. Secondly, a new annotated dataset for HD research is introduced which is significantly larger and more diverse than existing datasets containing realistic howling artifacts. Thirdly, a new HD performance evaluation procedure is proposed that is suitable when using HD features that do not rely on a howling candidates selection. This procedure opens the door for the evaluation of early howling and ringing detection performance and can handle the high class imbalance inherent in the HD problem by using precision-recall (PR) instead of receiver operating characteristic (ROC) curves.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2025 1","pages":"14"},"PeriodicalIF":1.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11950036/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143755452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compression of room impulse responses for compact storage and fast low-latency convolution 压缩室内脉冲响应，实现紧凑存储和快速低延迟卷积

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-09-13 DOI: 10.1186/s13636-024-00363-5

Martin Jälmby, Filip Elvander, Toon van Waterschoot

{"title":"Compression of room impulse responses for compact storage and fast low-latency convolution","authors":"Martin Jälmby, Filip Elvander, Toon van Waterschoot","doi":"10.1186/s13636-024-00363-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00363-5","url":null,"abstract":"Room impulse responses (RIRs) are used in several applications, such as augmented reality and virtual reality. These applications require a large number of RIRs to be convolved with audio, under strict latency constraints. In this paper, we consider the compression of RIRs, in conjunction with fast time-domain convolution. We consider three different methods of RIR approximation for the purpose of RIR compression and compare them to state-of-the-art compression. The methods are evaluated using several standard objective quality measures, both channel-based and signal-based. We also propose a novel low-rank-based algorithm for fast time-domain convolution and show how the convolution can be carried out without the need to decompress the RIR. Numerical simulations are performed using RIRs of different lengths, recorded in three different rooms. It is shown that compression using low-rank approximation is a very compelling option to the state-of-the-art Opus compression, as it performs as well or better than on all but one considered measure, with the added benefit of being amenable to fast time-domain convolution.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"16 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest editorial: AI for computational audition—sound and music processing 特邀社论：计算听觉的人工智能--声音和音乐处理

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-09-11 DOI: 10.1186/s13636-024-00353-7

Zijin Li, Wenwu Wang, Kejun Zhang, Mengyao Zhu

引用次数: 0

Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach 区域到区域声学传递函数的物理约束自适应内核插值：一种贝叶斯方法

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-09-10 DOI: 10.1186/s13636-024-00362-6

Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari

{"title":"Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach","authors":"Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari","doi":"10.1186/s13636-024-00362-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00362-6","url":null,"abstract":"A kernel interpolation method for the acoustic transfer function (ATF) between regions constrained by the physics of sound while being adaptive to the data is proposed. Most ATF interpolation methods aim to model the ATF for fixed source by using techniques that fit the estimation to the measurements while not taking the physics of the problem into consideration. We aim to interpolate the ATF for a region-to-region estimation, meaning we account for variation of both source and receiver positions. By using a very general formulation for the reproducing kernel function, we have created a kernel function that considers both directed and residual fields as two separate kernel functions. The directed field kernel considers a sparse selection of reflective field components with large amplitudes and is formulated as a combination of directional kernels. The residual field is composed of the remaining densely distributed components with lower amplitudes. Its kernel weight is represented by a universal approximator, a neural network, in order to learn patterns from the data freely. These kernel parameters are learned using Bayesian inference both under the assumption of Gaussian priors and by using a Markov chain Monte Carlo simulation method to perform inference in a more directed manner. We compare all established kernel formulations with each other in numerical simulations, showing that the proposed kernel model is capable of properly representing the complexities of the ATF.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"60 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Physics-informed neural network for volumetric sound field reconstruction of speech signals 用于语音信号体积声场重建的物理信息神经网络

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-09-09 DOI: 10.1186/s13636-024-00366-2

Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande

{"title":"Physics-informed neural network for volumetric sound field reconstruction of speech signals","authors":"Marco Olivieri, Xenofon Karakonstantis, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00366-2","DOIUrl":"https://doi.org/10.1186/s13636-024-00366-2","url":null,"abstract":"Recent developments in acoustic signal processing have seen the integration of deep learning methodologies, alongside the continued prominence of classical wave expansion-based approaches, particularly in sound field reconstruction. Physics-informed neural networks (PINNs) have emerged as a novel framework, bridging the gap between data-driven and model-based techniques for addressing physical phenomena governed by partial differential equations. This paper introduces a PINN-based approach for the recovery of arbitrary volumetric acoustic fields. The network incorporates the wave equation to impose a regularization on signal reconstruction in the time domain. This methodology enables the network to learn the physical law of sound propagation and allows for the complete characterization of the sound field based on a limited set of observations. The proposed method’s efficacy is validated through experiments involving speech signals in a real-world environment, considering varying numbers of available measurements. Moreover, a comparative analysis is undertaken against state-of-the-art frequency domain and time domain reconstruction methods from existing literature, highlighting the increased accuracy across the various measurement configurations.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal sensor placement for the spatial reconstruction of sound fields 声场空间重建的最佳传感器位置

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-08-17 DOI: 10.1186/s13636-024-00364-4

Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande

{"title":"Optimal sensor placement for the spatial reconstruction of sound fields","authors":"Samuel A. Verburg, Filip Elvander, Toon van Waterschoot, Efren Fernandez-Grande","doi":"10.1186/s13636-024-00364-4","DOIUrl":"https://doi.org/10.1186/s13636-024-00364-4","url":null,"abstract":"The estimation sound fields over space is of interest in sound field control and analysis, spatial audio, room acoustics and virtual reality. Sound fields can be estimated from a number of measurements distributed over space yet this remains a challenging problem due to the large experimental effort required. In this work we investigate sensor distributions that are optimal to estimate sound fields. Such optimization is valuable as it can greatly reduce the number of measurements required. The sensor positions are optimized with respect to the parameters describing a sound field, or the pressure reconstructed at the area of interest, by finding the positions that minimize the Bayesian Cramér-Rao bound (BCRB). The optimized distributions are investigated in a numerical study as well as with measured room impulse responses. We observe a reduction in the number of measurements of approximately 50% when the sensor positions are optimized for reconstructing the sound field when compared with random distributions. The results indicate that optimizing the sensors positions is also valuable when the vector of parameters is sparse, specially compared with random sensor distributions, which are often adopted in sparse array processing in acoustics.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"425 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recognition of target domain Japanese speech using language model replacement 使用语言模型替换识别目标域日语语音

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-07-20 DOI: 10.1186/s13636-024-00360-8

Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka

{"title":"Recognition of target domain Japanese speech using language model replacement","authors":"Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka","doi":"10.1186/s13636-024-00360-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00360-8","url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"27 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The whole is greater than the sum of its parts: improving music source separation by bridging networks 整体大于部分之和：通过网络桥接改善音乐源分离效果

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-07-19 DOI: 10.1186/s13636-024-00354-6

Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

{"title":"The whole is greater than the sum of its parts: improving music source separation by bridging networks","authors":"Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji","doi":"10.1186/s13636-024-00354-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00354-6","url":null,"abstract":"This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX .","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"35 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition 探索藏语多方言语音识别中的任务多样化元学习

IF 2.4 3区计算机科学

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-07-17 DOI: 10.1186/s13636-024-00361-7

Yigang Liu, Yue Zhao, Xiaona Xu, Liang Xu, Xubei Zhang, Qiang Ji

引用次数: 0