arXiv - EE - Audio and Speech Processing最新文献

筛选
英文 中文
WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification WMCodec:带有深度水印的端到端神经语音编解码器,用于真实性验证
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12121
Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang
{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":"https://doi.org/arxiv-2409.12121","url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\u0000mechanisms in neural speech codecs to ensure authenticity. Current methods\u0000embed numerical watermarks before compression and extract them from\u0000reconstructed speech for verification, but face limitations such as separate\u0000training processes for the watermark and codec, and insufficient cross-modal\u0000information integration, leading to reduced watermark imperceptibility,\u0000extraction accuracy, and capacity. To address these issues, we propose WMCodec,\u0000the first neural speech codec to jointly train compression-reconstruction and\u0000watermark embedding-extraction in an end-to-end manner, optimizing both\u0000imperceptibility and extractability of the watermark. Furthermore, We design an\u0000iterative Attention Imprint Unit (AIU) for deeper feature integration of\u0000watermark and speech, reducing the impact of quantization noise on the\u0000watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\u0000in most quality metrics for watermark imperceptibility and consistently exceeds\u0000both AudioSeal with Encodec and reinforced TraceableSpeech in extraction\u0000accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\u0000bps, WMCodec maintains over 99% extraction accuracy under common attacks,\u0000demonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information DETECLAP:利用对象信息加强视听表征学习
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11729
Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu
{"title":"DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information","authors":"Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu","doi":"arxiv-2409.11729","DOIUrl":"https://doi.org/arxiv-2409.11729","url":null,"abstract":"Current audio-visual representation learning can capture rough object\u0000categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to\u0000recognize fine-grained details, such as specific categories like ``dogs'' and\u0000``flutes'' within animals and instruments. To address this issue, we introduce\u0000DETECLAP, a method to enhance audio-visual representation learning with object\u0000information. Our key idea is to introduce an audio-visual label prediction loss\u0000to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its\u0000object awareness. To avoid costly manual annotations, we prepare object labels\u0000from both audio and visual inputs using state-of-the-art language-audio models\u0000and object detectors. We evaluate the method of audio-visual retrieval and\u0000classification using the VGGSound and AudioSet20K datasets. Our method achieves\u0000improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and\u0000visual-to-audio retrieval, respectively, and an improvement in accuracy of\u0000+0.6% for audio-visual classification.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spin Detection Using Racket Bounce Sounds in Table Tennis 利用乒乓球拍弹跳声检测旋转
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11760
Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell
{"title":"Spin Detection Using Racket Bounce Sounds in Table Tennis","authors":"Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell","doi":"arxiv-2409.11760","DOIUrl":"https://doi.org/arxiv-2409.11760","url":null,"abstract":"While table tennis players primarily rely on visual cues, sound provides\u0000valuable information. The sound generated when the ball strikes the racket can\u0000assist in predicting the ball's trajectory, especially in determining the spin.\u0000While professional players can distinguish spin through these auditory cues,\u0000they often go unnoticed by untrained players. In this paper, we demonstrate\u0000that different rackets produce distinct sounds, which can be used to identify\u0000the racket type. In addition, we show that the sound generated by the racket\u0000can indicate whether spin was applied to the ball, or not. To achieve this, we\u0000created a comprehensive dataset featuring bounce sounds from 10 racket\u0000configurations, each applying various spins to the ball. To achieve millisecond\u0000level temporal accuracy, we first detect high frequency peaks that may\u0000correspond to table tennis ball bounces. We then refine these results using a\u0000CNN based classifier that accurately predicts both the type of racket used and\u0000whether spin was applied.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conformal Prediction for Manifold-based Source Localization with Gaussian Processes 用高斯过程进行基于 Manifold 的声源定位的共形预测
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11804
Vadim Rozenfeld, Bracha Laufer Goldshtein
{"title":"Conformal Prediction for Manifold-based Source Localization with Gaussian Processes","authors":"Vadim Rozenfeld, Bracha Laufer Goldshtein","doi":"arxiv-2409.11804","DOIUrl":"https://doi.org/arxiv-2409.11804","url":null,"abstract":"We tackle the challenge of uncertainty quantification in the localization of\u0000a sound source within adverse acoustic environments. Estimating the position of\u0000the source is influenced by various factors such as noise and reverberation,\u0000leading to significant uncertainty. Quantifying this uncertainty is essential,\u0000particularly when localization outcomes impact critical decision-making\u0000processes, such as in robot audition, where the accuracy of location estimates\u0000directly influences subsequent actions. Despite this, many localization methods\u0000typically offer point estimates without quantifying the estimation uncertainty.\u0000To address this, we employ conformal prediction (CP)-a framework that delivers\u0000statistically valid prediction intervals with finite-sample guarantees,\u0000independent of the data distribution. However, commonly used Inductive CP (ICP)\u0000methods require a substantial amount of labeled data, which can be difficult to\u0000obtain in the localization setting. To mitigate this limitation, we incorporate\u0000a manifold-based localization method using Gaussian process regression (GPR),\u0000with an efficient Transductive CP (TCP) technique specifically designed for\u0000GPR. We demonstrate that our method generates statistically valid uncertainty\u0000intervals across different acoustic conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech DPI-TTS:用于文本到语音中快速转换和风格时态建模的定向补丁交互
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11835
Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":"https://doi.org/arxiv-2409.11835","url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\u0000widely used U-Net architecture, transformer-based models such as the Diffusion\u0000Transformer (DiT) have also gained attention. However, current DiT speech\u0000models treat Mel spectrograms as general images, which overlooks the specific\u0000acoustic properties of speech. To address these limitations, we propose a\u0000method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\u0000builds on DiT and achieves fast training without compromising accuracy.\u0000Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\u0000inference approach that aligns more closely with acoustic properties, enhancing\u0000the naturalness of the generated speech. Additionally, we introduce a\u0000fine-grained style temporal modeling method that further improves speaker style\u0000similarity. Experimental results demonstrate that our method increases the\u0000training speed by nearly 2 times and significantly outperforms the baseline\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Dense-TSNet:用于超轻量级语音增强的密集连接两级结构
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11725
Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li
{"title":"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":"https://doi.org/arxiv-2409.11725","url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\u0000noisy environments. Recent advancements have concentrated on deep neural\u0000networks, particularly employing the Two-Stage (TS) architecture to enhance\u0000feature extraction. However, the complexity and size of these models remain\u0000significant, which limits their applicability in resource-constrained\u0000scenarios. Designing models suitable for edge devices presents its own set of\u0000challenges. Narrow lightweight models often encounter performance bottlenecks\u0000due to uneven loss landscapes. Additionally, advanced operators such as\u0000Transformers or Mamba may lack the practical adaptability and efficiency that\u0000convolutional neural networks (CNNs) offer in real-world deployments. To\u0000address these challenges, we propose Dense-TSNet, an innovative\u0000ultra-lightweight speech enhancement network. Our approach employs a novel\u0000Dense Two-Stage (Dense-TS) architecture, which, compared to the classic\u0000Two-Stage architecture, ensures more robust refinement of the objective\u0000function in the later training stages. This leads to improved final\u0000performance, addressing the early convergence limitations of the baseline\u0000model. We also introduce the Multi-View Gaze Block (MVGB), which enhances\u0000feature extraction by incorporating global, channel, and local perspectives\u0000through convolutional neural networks (CNNs). Furthermore, we discuss how the\u0000choice of loss function impacts perceptual quality. Dense-TSNet demonstrates\u0000promising performance with a compact model size of around 14K parameters,\u0000making it particularly well-suited for deployment in resource-constrained\u0000environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays 利用可佩戴麦克风阵列在双耳信号匹配中纳入信号信息的启示
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11731
Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely
{"title":"Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays","authors":"Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely","doi":"arxiv-2409.11731","DOIUrl":"https://doi.org/arxiv-2409.11731","url":null,"abstract":"The increasing popularity of spatial audio in applications such as\u0000teleconferencing, entertainment, and virtual reality has led to the recent\u0000developments of binaural reproduction methods. However, only a few of these\u0000methods are well-suited for wearable and mobile arrays, which typically consist\u0000of a small number of microphones. One such method is binaural signal matching\u0000(BSM), which has been shown to produce high-quality binaural signals for\u0000wearable arrays. However, BSM may be suboptimal in cases of high\u0000direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field\u0000assumption. To overcome this limitation, previous studies incorporated\u0000sound-field models other than diffuse. However, this approach was not studied\u0000comprehensively. This paper extensively investigates two BSM-based methods\u0000designed for high DRR scenarios. The methods incorporate a sound field model\u0000composed of direct and reverberant components.The methods are investigated both\u0000mathematically and using simulations, finally validated by a listening test.\u0000The results show that the proposed methods can significantly improve the\u0000performance of BSM , in particular in the direction of the source, while\u0000presenting only a negligible degradation in other directions. Furthermore, when\u0000source direction estimation is inaccurate, performance of these methods degrade\u0000to equal that of the BSM, presenting a desired robustness quality.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 使用冻结的 wav2vec 2.0 进行专家混合假音频检测
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11909
Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li
{"title":"Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0","authors":"Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11909","DOIUrl":"https://doi.org/arxiv-2409.11909","url":null,"abstract":"Speech synthesis technology has posed a serious threat to speaker\u0000verification systems. Currently, the most effective fake audio detection methods utilize pretrained\u0000models, and integrating features from various layers of pretrained model\u0000further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning\u0000the pretrained models, resulting in excessively long training times and\u0000hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on\u0000the Mixture of Experts, which extracts and integrates features relevant to fake\u0000audio detection from layer features, guided by a gating network based on the\u0000last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets\u0000demonstrate that the proposed method achieves competitive performance compared\u0000to those requiring fine-tuning.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASR Benchmarking: Need for a More Representative Conversational Dataset ASR 基准测试:需要更具代表性的对话数据集
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12042
Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad
{"title":"ASR Benchmarking: Need for a More Representative Conversational Dataset","authors":"Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad","doi":"arxiv-2409.12042","DOIUrl":"https://doi.org/arxiv-2409.12042","url":null,"abstract":"Automatic Speech Recognition (ASR) systems have achieved remarkable\u0000performance on widely used benchmarks such as LibriSpeech and Fleurs. However,\u0000these benchmarks do not adequately reflect the complexities of real-world\u0000conversational environments, where speech is often unstructured and contains\u0000disfluencies such as pauses, interruptions, and diverse accents. In this study,\u0000we introduce a multilingual conversational dataset, derived from TalkBank,\u0000consisting of unstructured phone conversation between adults. Our results show\u0000a significant performance drop across various state-of-the-art ASR models when\u0000tested in conversational settings. Furthermore, we observe a correlation\u0000between Word Error Rate and the presence of speech disfluencies, highlighting\u0000the critical need for more realistic, conversational ASR benchmarks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation 从粗到细:通过多尺度语音编码和生成改进神经编解码器语言模型
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11630
Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng
{"title":"Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation","authors":"Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng","doi":"arxiv-2409.11630","DOIUrl":"https://doi.org/arxiv-2409.11630","url":null,"abstract":"The neural codec language model (CLM) has demonstrated remarkable performance\u0000in text-to-speech (TTS) synthesis. However, troubled by ``recency bias\", CLM\u0000lacks sufficient attention to coarse-grained information at a higher temporal\u0000scale, often producing unnatural or even unintelligible speech. This work\u0000proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale\u0000speech coding and generation to address this issue. We train a multi-scale\u0000neural codec, CoFi-Codec, to encode speech into a multi-scale discrete\u0000representation, comprising multiple token sequences with different time\u0000resolutions. Then, we propose CoFi-LM that can generate this representation in\u0000two modes: the single-LM-based chain-of-scale generation and the\u0000multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech\u0000significantly outperforms single-scale baseline systems on naturalness and\u0000speaker similarity in zero-shot TTS. The analysis of multi-scale coding\u0000demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete\u0000speech representations while keeping high-quality speech reconstruction. The\u0000coarse-to-fine multi-scale generation, especially for the stack-of-scale\u0000approach, is also validated as a crucial approach in pursuing a high-quality\u0000neural codec language model for TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信