arXiv - EE - Audio and Speech Processing最新文献

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification WMCodec：带有深度水印的端到端神经语音编解码器，用于真实性验证

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12121

Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang

{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":"https://doi.org/arxiv-2409.12121","url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\u0000mechanisms in neural speech codecs to ensure authenticity. Current methods\u0000embed numerical watermarks before compression and extract them from\u0000reconstructed speech for verification, but face limitations such as separate\u0000training processes for the watermark and codec, and insufficient cross-modal\u0000information integration, leading to reduced watermark imperceptibility,\u0000extraction accuracy, and capacity. To address these issues, we propose WMCodec,\u0000the first neural speech codec to jointly train compression-reconstruction and\u0000watermark embedding-extraction in an end-to-end manner, optimizing both\u0000imperceptibility and extractability of the watermark. Furthermore, We design an\u0000iterative Attention Imprint Unit (AIU) for deeper feature integration of\u0000watermark and speech, reducing the impact of quantization noise on the\u0000watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\u0000in most quality metrics for watermark imperceptibility and consistently exceeds\u0000both AudioSeal with Encodec and reinforced TraceableSpeech in extraction\u0000accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\u0000bps, WMCodec maintains over 99% extraction accuracy under common attacks,\u0000demonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"197 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information DETECLAP：利用对象信息加强视听表征学习

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11729

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

引用次数: 0

Spin Detection Using Racket Bounce Sounds in Table Tennis 利用乒乓球拍弹跳声检测旋转

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11760

Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell

引用次数: 0

Conformal Prediction for Manifold-based Source Localization with Gaussian Processes 用高斯过程进行基于 Manifold 的声源定位的共形预测

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11804

Vadim Rozenfeld, Bracha Laufer Goldshtein

{"title":"Conformal Prediction for Manifold-based Source Localization with Gaussian Processes","authors":"Vadim Rozenfeld, Bracha Laufer Goldshtein","doi":"arxiv-2409.11804","DOIUrl":"https://doi.org/arxiv-2409.11804","url":null,"abstract":"We tackle the challenge of uncertainty quantification in the localization of\u0000a sound source within adverse acoustic environments. Estimating the position of\u0000the source is influenced by various factors such as noise and reverberation,\u0000leading to significant uncertainty. Quantifying this uncertainty is essential,\u0000particularly when localization outcomes impact critical decision-making\u0000processes, such as in robot audition, where the accuracy of location estimates\u0000directly influences subsequent actions. Despite this, many localization methods\u0000typically offer point estimates without quantifying the estimation uncertainty.\u0000To address this, we employ conformal prediction (CP)-a framework that delivers\u0000statistically valid prediction intervals with finite-sample guarantees,\u0000independent of the data distribution. However, commonly used Inductive CP (ICP)\u0000methods require a substantial amount of labeled data, which can be difficult to\u0000obtain in the localization setting. To mitigate this limitation, we incorporate\u0000a manifold-based localization method using Gaussian process regression (GPR),\u0000with an efficient Transductive CP (TCP) technique specifically designed for\u0000GPR. We demonstrate that our method generates statistically valid uncertainty\u0000intervals across different acoustic conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech DPI-TTS：用于文本到语音中快速转换和风格时态建模的定向补丁交互

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11835

Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li

{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":"https://doi.org/arxiv-2409.11835","url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\u0000widely used U-Net architecture, transformer-based models such as the Diffusion\u0000Transformer (DiT) have also gained attention. However, current DiT speech\u0000models treat Mel spectrograms as general images, which overlooks the specific\u0000acoustic properties of speech. To address these limitations, we propose a\u0000method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\u0000builds on DiT and achieves fast training without compromising accuracy.\u0000Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\u0000inference approach that aligns more closely with acoustic properties, enhancing\u0000the naturalness of the generated speech. Additionally, we introduce a\u0000fine-grained style temporal modeling method that further improves speaker style\u0000similarity. Experimental results demonstrate that our method increases the\u0000training speed by nearly 2 times and significantly outperforms the baseline\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Dense-TSNet：用于超轻量级语音增强的密集连接两级结构

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11725

Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li

{"title":"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":"https://doi.org/arxiv-2409.11725","url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\u0000noisy environments. Recent advancements have concentrated on deep neural\u0000networks, particularly employing the Two-Stage (TS) architecture to enhance\u0000feature extraction. However, the complexity and size of these models remain\u0000significant, which limits their applicability in resource-constrained\u0000scenarios. Designing models suitable for edge devices presents its own set of\u0000challenges. Narrow lightweight models often encounter performance bottlenecks\u0000due to uneven loss landscapes. Additionally, advanced operators such as\u0000Transformers or Mamba may lack the practical adaptability and efficiency that\u0000convolutional neural networks (CNNs) offer in real-world deployments. To\u0000address these challenges, we propose Dense-TSNet, an innovative\u0000ultra-lightweight speech enhancement network. Our approach employs a novel\u0000Dense Two-Stage (Dense-TS) architecture, which, compared to the classic\u0000Two-Stage architecture, ensures more robust refinement of the objective\u0000function in the later training stages. This leads to improved final\u0000performance, addressing the early convergence limitations of the baseline\u0000model. We also introduce the Multi-View Gaze Block (MVGB), which enhances\u0000feature extraction by incorporating global, channel, and local perspectives\u0000through convolutional neural networks (CNNs). Furthermore, we discuss how the\u0000choice of loss function impacts perceptual quality. Dense-TSNet demonstrates\u0000promising performance with a compact model size of around 14K parameters,\u0000making it particularly well-suited for deployment in resource-constrained\u0000environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays 利用可佩戴麦克风阵列在双耳信号匹配中纳入信号信息的启示

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11731

Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely

{"title":"Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays","authors":"Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely","doi":"arxiv-2409.11731","DOIUrl":"https://doi.org/arxiv-2409.11731","url":null,"abstract":"The increasing popularity of spatial audio in applications such as\u0000teleconferencing, entertainment, and virtual reality has led to the recent\u0000developments of binaural reproduction methods. However, only a few of these\u0000methods are well-suited for wearable and mobile arrays, which typically consist\u0000of a small number of microphones. One such method is binaural signal matching\u0000(BSM), which has been shown to produce high-quality binaural signals for\u0000wearable arrays. However, BSM may be suboptimal in cases of high\u0000direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field\u0000assumption. To overcome this limitation, previous studies incorporated\u0000sound-field models other than diffuse. However, this approach was not studied\u0000comprehensively. This paper extensively investigates two BSM-based methods\u0000designed for high DRR scenarios. The methods incorporate a sound field model\u0000composed of direct and reverberant components.The methods are investigated both\u0000mathematically and using simulations, finally validated by a listening test.\u0000The results show that the proposed methods can significantly improve the\u0000performance of BSM , in particular in the direction of the source, while\u0000presenting only a negligible degradation in other directions. Furthermore, when\u0000source direction estimation is inaccurate, performance of these methods degrade\u0000to equal that of the BSM, presenting a desired robustness quality.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 使用冻结的 wav2vec 2.0 进行专家混合假音频检测

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11909

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li

引用次数: 0

ASR Benchmarking: Need for a More Representative Conversational Dataset ASR 基准测试：需要更具代表性的对话数据集

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12042

Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad

引用次数: 0

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction 利用教师启发的混淆类教学进行数据高效的声学场景分类

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11964

Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

{"title":"Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction","authors":"Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan","doi":"arxiv-2409.11964","DOIUrl":"https://doi.org/arxiv-2409.11964","url":null,"abstract":"In this technical report, we describe the SNTL-NTU team's submission for Task\u00001 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection\u0000and classification of acoustic scenes and events (DCASE) 2024 challenge. Three\u0000systems are introduced to tackle training splits of different sizes. For small\u0000training splits, we explored reducing the complexity of the provided baseline\u0000model by reducing the number of base channels. We introduce data augmentation\u0000in the form of mixup to increase the diversity of training samples. For the\u0000larger training splits, we use FocusNet to provide confusing class information\u0000to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models\u0000and baseline models trained on the original sampling rate of 44.1 kHz. We use\u0000Knowledge Distillation to distill the ensemble model to the baseline student\u0000model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile\u0000development dataset yielded the highest average testing accuracy of (62.21,\u000059.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over\u0000the three systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0