{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":"https://doi.org/arxiv-2409.12121","url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\u0000mechanisms in neural speech codecs to ensure authenticity. Current methods\u0000embed numerical watermarks before compression and extract them from\u0000reconstructed speech for verification, but face limitations such as separate\u0000training processes for the watermark and codec, and insufficient cross-modal\u0000information integration, leading to reduced watermark imperceptibility,\u0000extraction accuracy, and capacity. To address these issues, we propose WMCodec,\u0000the first neural speech codec to jointly train compression-reconstruction and\u0000watermark embedding-extraction in an end-to-end manner, optimizing both\u0000imperceptibility and extractability of the watermark. Furthermore, We design an\u0000iterative Attention Imprint Unit (AIU) for deeper feature integration of\u0000watermark and speech, reducing the impact of quantization noise on the\u0000watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\u0000in most quality metrics for watermark imperceptibility and consistently exceeds\u0000both AudioSeal with Encodec and reinforced TraceableSpeech in extraction\u0000accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\u0000bps, WMCodec maintains over 99% extraction accuracy under common attacks,\u0000demonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information","authors":"Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu","doi":"arxiv-2409.11729","DOIUrl":"https://doi.org/arxiv-2409.11729","url":null,"abstract":"Current audio-visual representation learning can capture rough object\u0000categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to\u0000recognize fine-grained details, such as specific categories like ``dogs'' and\u0000``flutes'' within animals and instruments. To address this issue, we introduce\u0000DETECLAP, a method to enhance audio-visual representation learning with object\u0000information. Our key idea is to introduce an audio-visual label prediction loss\u0000to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its\u0000object awareness. To avoid costly manual annotations, we prepare object labels\u0000from both audio and visual inputs using state-of-the-art language-audio models\u0000and object detectors. We evaluate the method of audio-visual retrieval and\u0000classification using the VGGSound and AudioSet20K datasets. Our method achieves\u0000improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and\u0000visual-to-audio retrieval, respectively, and an improvement in accuracy of\u0000+0.6% for audio-visual classification.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell
{"title":"Spin Detection Using Racket Bounce Sounds in Table Tennis","authors":"Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell","doi":"arxiv-2409.11760","DOIUrl":"https://doi.org/arxiv-2409.11760","url":null,"abstract":"While table tennis players primarily rely on visual cues, sound provides\u0000valuable information. The sound generated when the ball strikes the racket can\u0000assist in predicting the ball's trajectory, especially in determining the spin.\u0000While professional players can distinguish spin through these auditory cues,\u0000they often go unnoticed by untrained players. In this paper, we demonstrate\u0000that different rackets produce distinct sounds, which can be used to identify\u0000the racket type. In addition, we show that the sound generated by the racket\u0000can indicate whether spin was applied to the ball, or not. To achieve this, we\u0000created a comprehensive dataset featuring bounce sounds from 10 racket\u0000configurations, each applying various spins to the ball. To achieve millisecond\u0000level temporal accuracy, we first detect high frequency peaks that may\u0000correspond to table tennis ball bounces. We then refine these results using a\u0000CNN based classifier that accurately predicts both the type of racket used and\u0000whether spin was applied.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conformal Prediction for Manifold-based Source Localization with Gaussian Processes","authors":"Vadim Rozenfeld, Bracha Laufer Goldshtein","doi":"arxiv-2409.11804","DOIUrl":"https://doi.org/arxiv-2409.11804","url":null,"abstract":"We tackle the challenge of uncertainty quantification in the localization of\u0000a sound source within adverse acoustic environments. Estimating the position of\u0000the source is influenced by various factors such as noise and reverberation,\u0000leading to significant uncertainty. Quantifying this uncertainty is essential,\u0000particularly when localization outcomes impact critical decision-making\u0000processes, such as in robot audition, where the accuracy of location estimates\u0000directly influences subsequent actions. Despite this, many localization methods\u0000typically offer point estimates without quantifying the estimation uncertainty.\u0000To address this, we employ conformal prediction (CP)-a framework that delivers\u0000statistically valid prediction intervals with finite-sample guarantees,\u0000independent of the data distribution. However, commonly used Inductive CP (ICP)\u0000methods require a substantial amount of labeled data, which can be difficult to\u0000obtain in the localization setting. To mitigate this limitation, we incorporate\u0000a manifold-based localization method using Gaussian process regression (GPR),\u0000with an efficient Transductive CP (TCP) technique specifically designed for\u0000GPR. We demonstrate that our method generates statistically valid uncertainty\u0000intervals across different acoustic conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":"https://doi.org/arxiv-2409.11835","url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\u0000widely used U-Net architecture, transformer-based models such as the Diffusion\u0000Transformer (DiT) have also gained attention. However, current DiT speech\u0000models treat Mel spectrograms as general images, which overlooks the specific\u0000acoustic properties of speech. To address these limitations, we propose a\u0000method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\u0000builds on DiT and achieves fast training without compromising accuracy.\u0000Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\u0000inference approach that aligns more closely with acoustic properties, enhancing\u0000the naturalness of the generated speech. Additionally, we introduce a\u0000fine-grained style temporal modeling method that further improves speaker style\u0000similarity. Experimental results demonstrate that our method increases the\u0000training speed by nearly 2 times and significantly outperforms the baseline\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":"https://doi.org/arxiv-2409.11725","url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\u0000noisy environments. Recent advancements have concentrated on deep neural\u0000networks, particularly employing the Two-Stage (TS) architecture to enhance\u0000feature extraction. However, the complexity and size of these models remain\u0000significant, which limits their applicability in resource-constrained\u0000scenarios. Designing models suitable for edge devices presents its own set of\u0000challenges. Narrow lightweight models often encounter performance bottlenecks\u0000due to uneven loss landscapes. Additionally, advanced operators such as\u0000Transformers or Mamba may lack the practical adaptability and efficiency that\u0000convolutional neural networks (CNNs) offer in real-world deployments. To\u0000address these challenges, we propose Dense-TSNet, an innovative\u0000ultra-lightweight speech enhancement network. Our approach employs a novel\u0000Dense Two-Stage (Dense-TS) architecture, which, compared to the classic\u0000Two-Stage architecture, ensures more robust refinement of the objective\u0000function in the later training stages. This leads to improved final\u0000performance, addressing the early convergence limitations of the baseline\u0000model. We also introduce the Multi-View Gaze Block (MVGB), which enhances\u0000feature extraction by incorporating global, channel, and local perspectives\u0000through convolutional neural networks (CNNs). Furthermore, we discuss how the\u0000choice of loss function impacts perceptual quality. Dense-TSNet demonstrates\u0000promising performance with a compact model size of around 14K parameters,\u0000making it particularly well-suited for deployment in resource-constrained\u0000environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely
{"title":"Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays","authors":"Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely","doi":"arxiv-2409.11731","DOIUrl":"https://doi.org/arxiv-2409.11731","url":null,"abstract":"The increasing popularity of spatial audio in applications such as\u0000teleconferencing, entertainment, and virtual reality has led to the recent\u0000developments of binaural reproduction methods. However, only a few of these\u0000methods are well-suited for wearable and mobile arrays, which typically consist\u0000of a small number of microphones. One such method is binaural signal matching\u0000(BSM), which has been shown to produce high-quality binaural signals for\u0000wearable arrays. However, BSM may be suboptimal in cases of high\u0000direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field\u0000assumption. To overcome this limitation, previous studies incorporated\u0000sound-field models other than diffuse. However, this approach was not studied\u0000comprehensively. This paper extensively investigates two BSM-based methods\u0000designed for high DRR scenarios. The methods incorporate a sound field model\u0000composed of direct and reverberant components.The methods are investigated both\u0000mathematically and using simulations, finally validated by a listening test.\u0000The results show that the proposed methods can significantly improve the\u0000performance of BSM , in particular in the direction of the source, while\u0000presenting only a negligible degradation in other directions. Furthermore, when\u0000source direction estimation is inaccurate, performance of these methods degrade\u0000to equal that of the BSM, presenting a desired robustness quality.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0","authors":"Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11909","DOIUrl":"https://doi.org/arxiv-2409.11909","url":null,"abstract":"Speech synthesis technology has posed a serious threat to speaker\u0000verification systems. Currently, the most effective fake audio detection methods utilize pretrained\u0000models, and integrating features from various layers of pretrained model\u0000further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning\u0000the pretrained models, resulting in excessively long training times and\u0000hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on\u0000the Mixture of Experts, which extracts and integrates features relevant to fake\u0000audio detection from layer features, guided by a gating network based on the\u0000last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets\u0000demonstrate that the proposed method achieves competitive performance compared\u0000to those requiring fine-tuning.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad
{"title":"ASR Benchmarking: Need for a More Representative Conversational Dataset","authors":"Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad","doi":"arxiv-2409.12042","DOIUrl":"https://doi.org/arxiv-2409.12042","url":null,"abstract":"Automatic Speech Recognition (ASR) systems have achieved remarkable\u0000performance on widely used benchmarks such as LibriSpeech and Fleurs. However,\u0000these benchmarks do not adequately reflect the complexities of real-world\u0000conversational environments, where speech is often unstructured and contains\u0000disfluencies such as pauses, interruptions, and diverse accents. In this study,\u0000we introduce a multilingual conversational dataset, derived from TalkBank,\u0000consisting of unstructured phone conversation between adults. Our results show\u0000a significant performance drop across various state-of-the-art ASR models when\u0000tested in conversational settings. Furthermore, we observe a correlation\u0000between Word Error Rate and the presence of speech disfluencies, highlighting\u0000the critical need for more realistic, conversational ASR benchmarks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng
{"title":"Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation","authors":"Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng","doi":"arxiv-2409.11630","DOIUrl":"https://doi.org/arxiv-2409.11630","url":null,"abstract":"The neural codec language model (CLM) has demonstrated remarkable performance\u0000in text-to-speech (TTS) synthesis. However, troubled by ``recency bias\", CLM\u0000lacks sufficient attention to coarse-grained information at a higher temporal\u0000scale, often producing unnatural or even unintelligible speech. This work\u0000proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale\u0000speech coding and generation to address this issue. We train a multi-scale\u0000neural codec, CoFi-Codec, to encode speech into a multi-scale discrete\u0000representation, comprising multiple token sequences with different time\u0000resolutions. Then, we propose CoFi-LM that can generate this representation in\u0000two modes: the single-LM-based chain-of-scale generation and the\u0000multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech\u0000significantly outperforms single-scale baseline systems on naturalness and\u0000speaker similarity in zero-shot TTS. The analysis of multi-scale coding\u0000demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete\u0000speech representations while keeping high-quality speech reconstruction. The\u0000coarse-to-fine multi-scale generation, especially for the stack-of-scale\u0000approach, is also validated as a crucial approach in pursuing a high-quality\u0000neural codec language model for TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}