2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)最新文献_第6页

Crowdsourcing Strong Labels for Sound Event Detection 众包声音事件检测的强标签

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-26 DOI: 10.1109/WASPAA52581.2021.9632761

Irene Mart'in-Morat'o, Manu Harju, A. Mesaros

引用次数: 5

Joint Direction and Proximity Classification of Overlapping Sound Events from Binaural Audio 双耳音频重叠声事件的联合方向和接近分类

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-26 DOI: 10.1109/WASPAA52581.2021.9632775

D. Krause, A. Politis, A. Mesaros

{"title":"Joint Direction and Proximity Classification of Overlapping Sound Events from Binaural Audio","authors":"D. Krause, A. Politis, A. Mesaros","doi":"10.1109/WASPAA52581.2021.9632775","DOIUrl":"https://doi.org/10.1109/WASPAA52581.2021.9632775","url":null,"abstract":"Sound source proximity and distance estimation are of great interest in many practical applications, since they provide significant information for acoustic scene analysis. As both tasks share complementary qualities, ensuring efficient interaction between these two is crucial for a complete picture of an aural environment. In this paper, we aim to investigate several ways of performing joint proximity and direction estimation from binaural recordings, both defined as coarse classification problems based on Deep Neural Networks (DNNs). Considering the limitations of binaural audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes. For each method we study different model types to acquire information about the direction-of-arrival (DoA). Finally, we propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of the appearing sources. Experiments are performed for a synthetic reverberant binaural dataset consisting of up to two overlapping sound events.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129423946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Saladnet: Self-Attentive Multisource Localization in the Ambisonics Domain Saladnet:立体声领域的自关注多源定位

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-23 DOI: 10.1109/WASPAA52581.2021.9632737

Pierre-Amaury Grumiaux, Srdan Kitic, Prerak Srivastava, Laurent Girin, Alexandre Gu'erin

引用次数: 4

Harp-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding Harp-Net:可扩展神经音频编码的超自编码重建传播

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-22 DOI: 10.1109/WASPAA52581.2021.9632723

Darius Petermann, Seungkwon Beack, Minje Kim

引用次数: 11

Controlling the Remixing of Separated Dialogue with a Non-Intrusive Quality Estimate 用非侵入性质量评估控制分离对话的重新混合

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-21 DOI: 10.1109/WASPAA52581.2021.9632756

Matteo Torcoli, Jouni Paulus, T. Kastner, C. Uhle

{"title":"Controlling the Remixing of Separated Dialogue with a Non-Intrusive Quality Estimate","authors":"Matteo Torcoli, Jouni Paulus, T. Kastner, C. Uhle","doi":"10.1109/WASPAA52581.2021.9632756","DOIUrl":"https://doi.org/10.1109/WASPAA52581.2021.9632756","url":null,"abstract":"Remixing separated audio sources trades off interferer attenuation against the amount of audible deteriorations. This paper proposes a non-intrusive audio quality estimation method for controlling this trade-off in a signal-adaptive manner. The recently proposed 2f-model is adopted as the underlying quality measure, since it has been shown to correlate strongly with basic audio quality in source separation. An alternative operation mode of the measure is proposed, more appropriate when considering material with long inactive periods of the target source. The 2f-model requires the reference target source as an input, but this is not available in many applications. Deep neural networks (DNNs) are trained to estimate the 2f-model intrusively using the reference target (iDNN2f), non-intrusively using the input mix as reference (nDNN2f), and reference-free using only the separated output signal (rDNN2f). It is shown that iDNN2f achieves very strong correlation with the original measure on the test data (Pearson $rho =0.99$), while performance decreases for nDNN2f (ρ ≥ 0.91) and rDNN2f (ρ ≥ 0.82). The non-intrusive estimate nDNN2f is mapped to select item-dependent remixing gains with the aim of maximizing the interferer attenuation under a constraint on the minimum quality of the remixed output (e.g., audible but not annoying deteriorations). A listening test shows that this is successfully achieved even with very different selected gains (up to 23 dB difference).","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129493278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk 目的评价双音过程中残余回波抑制的指标

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-15 DOI: 10.1109/WASPAA52581.2021.9632787

Amir Ivry, I. Cohen, B. Berdugo

{"title":"Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk","authors":"Amir Ivry, I. Cohen, B. Berdugo","doi":"10.1109/WASPAA52581.2021.9632787","DOIUrl":"https://doi.org/10.1109/WASPAA52581.2021.9632787","url":null,"abstract":"Human subjective evaluation is optimal to assess speech quality for human perception. The recently introduced deep noise suppression mean opinion score (DNSMOS) metric was shown to estimate human ratings with great accuracy. The signal-to-distortion ratio (SDR) metric is widely used to evaluate residual-echo suppression (RES) systems by estimating speech quality during double-talk. However, since the SDR is affected by both speech distortion and residual-echo presence, it does not correlate well with human ratings according to the DNSMOS. To address that, we introduce two objective metrics to separately quantify the desired-speech maintained level (DSML) and residual-echo suppression level (RESL) during double-talk. These metrics are evaluated using a deep learning-based RES-system with a tunable design parameter. Using 280 hours of real and simulated recordings, we show that the DSML and RESL correlate well with the DNSMOS with high generalization to various setups. Also, we empirically investigate the relation between tuning the RES-system design parameter and the DSML-RESL tradeoff it creates and offer a practical design scheme for dynamic system requirements.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132277106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation from Reverberant Speech 基于混响语音的时域房间脉冲响应估计的滤波噪声整形

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-15 DOI: 10.1109/WASPAA52581.2021.9632680

C. Steinmetz, V. Ithapu, P. Calamia

{"title":"Filtered Noise Shaping for Time Domain Room Impulse Response Estimation from Reverberant Speech","authors":"C. Steinmetz, V. Ithapu, P. Calamia","doi":"10.1109/WASPAA52581.2021.9632680","DOIUrl":"https://doi.org/10.1109/WASPAA52581.2021.9632680","url":null,"abstract":"Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio postproduction and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the $T_{60}$ and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130175790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Low Complexity Online Convolutional Beamforming 低复杂度在线卷积波束形成

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-14 DOI: 10.1109/WASPAA52581.2021.9632780

Sebastian Braun, I. Tashev

引用次数: 2

Adversarial Auto-Encoding for Packet Loss Concealment 丢包隐藏的对抗性自动编码

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-07-07 DOI: 10.1109/WASPAA52581.2021.9632730

Santiago Pascual, J. Serrà, Jordi Pons

引用次数: 15

DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement DF-Conformer:基于线性复杂度自注意的卷积- tasnet和Conformer的集成体系结构

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-06-30 DOI: 10.1109/WASPAA52581.2021.9632794

Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, J. Hershey, Llion Jones, M. Bacchiani

{"title":"DF-Conformer: Integrated Architecture of Conv-Tasnet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement","authors":"Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, J. Hershey, Llion Jones, M. Bacchiani","doi":"10.1109/WASPAA52581.2021.9632794","DOIUrl":"https://doi.org/10.1109/WASPAA52581.2021.9632794","url":null,"abstract":"Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an anal-ysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network. To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132778935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30