arXiv - EE - Audio and Speech Processing最新文献_第2页

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation 从粗到细：通过多尺度语音编码和生成改进神经编解码器语言模型

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11630

Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

{"title":"Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation","authors":"Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng","doi":"arxiv-2409.11630","DOIUrl":"https://doi.org/arxiv-2409.11630","url":null,"abstract":"The neural codec language model (CLM) has demonstrated remarkable performance\u0000in text-to-speech (TTS) synthesis. However, troubled by ``recency bias\", CLM\u0000lacks sufficient attention to coarse-grained information at a higher temporal\u0000scale, often producing unnatural or even unintelligible speech. This work\u0000proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale\u0000speech coding and generation to address this issue. We train a multi-scale\u0000neural codec, CoFi-Codec, to encode speech into a multi-scale discrete\u0000representation, comprising multiple token sequences with different time\u0000resolutions. Then, we propose CoFi-LM that can generate this representation in\u0000two modes: the single-LM-based chain-of-scale generation and the\u0000multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech\u0000significantly outperforms single-scale baseline systems on naturalness and\u0000speaker similarity in zero-shot TTS. The analysis of multi-scale coding\u0000demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete\u0000speech representations while keeping high-quality speech reconstruction. The\u0000coarse-to-fine multi-scale generation, especially for the stack-of-scale\u0000approach, is also validated as a crucial approach in pursuing a high-quality\u0000neural codec language model for TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference 低帧率语音编解码器：专为快速高质量语音 LLM 训练和推理而设计的编解码器

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12117

Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee

引用次数: 0

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models Takin：高质量零镜头语音生成模型群

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12139

EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang

{"title":"Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models","authors":"EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang","doi":"arxiv-2409.12139","DOIUrl":"https://doi.org/arxiv-2409.12139","url":null,"abstract":"With the advent of the big data and large language model era, zero-shot\u0000personalized rapid customization has emerged as a significant trend. In this\u0000report, we introduce Takin AudioLLM, a series of techniques and models, mainly\u0000including Takin TTS, Takin VC, and Takin Morphing, specifically designed for\u0000audiobook production. These models are capable of zero-shot speech production,\u0000generating high-quality speech that is nearly indistinguishable from real human\u0000speech and facilitating individuals to customize the speech content according\u0000to their own needs. Specifically, we first introduce Takin TTS, a neural codec\u0000language model that builds upon an enhanced neural speech codec and a\u0000multi-task training framework, capable of generating high-fidelity natural\u0000speech in a zero-shot way. For Takin VC, we advocate an effective content and\u0000timbre joint modeling approach to improve the speaker similarity, while\u0000advocating for a conditional flow matching based decoder to further enhance its\u0000naturalness and expressiveness. Last, we propose the Takin Morphing system with\u0000highly decoupled and advanced timbre and prosody modeling approaches, which\u0000enables individuals to customize speech production with their preferred timbre\u0000and prosody in a precise and controllable manner. Extensive experiments\u0000validate the effectiveness and robustness of our Takin AudioLLM series models.\u0000For detailed demos, please refer to https://takinaudiollm.github.io.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD) 帕累托数据框架：利用最小可行数据 (MVD) 进行资源节约型决策的步骤

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.12112

Tashfain Ahmed, Josh Siegel

{"title":"Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)","authors":"Tashfain Ahmed, Josh Siegel","doi":"arxiv-2409.12112","DOIUrl":"https://doi.org/arxiv-2409.12112","url":null,"abstract":"This paper introduces the Pareto Data Framework, an approach for identifying\u0000and selecting the Minimum Viable Data (MVD) required for enabling machine\u0000learning applications on constrained platforms such as embedded systems, mobile\u0000devices, and Internet of Things (IoT) devices. We demonstrate that strategic\u0000data reduction can maintain high performance while significantly reducing\u0000bandwidth, energy, computation, and storage costs. The framework identifies\u0000Minimum Viable Data (MVD) to optimize efficiency across resource-constrained\u0000environments without sacrificing performance. It addresses common inefficient\u0000practices in an IoT application such as overprovisioning of sensors and\u0000overprecision, and oversampling of signals, proposing scalable solutions for\u0000optimal sensor selection, signal extraction and transmission, and data\u0000representation. An experimental methodology demonstrates effective acoustic\u0000data characterization after downsampling, quantization, and truncation to\u0000simulate reduced-fidelity sensors and network and storage constraints; results\u0000shows that performance can be maintained up to 95% with sample rates reduced\u0000by 75% and bit depths and clip length reduced by 50% which translates into\u0000substantial cost and resource reduction. These findings have implications on\u0000the design and development of constrained systems. The paper also discusses\u0000broader implications of the framework, including the potential to democratize\u0000advanced AI technologies across IoT applications and sectors such as\u0000agriculture, transportation, and manufacturing to improve access and multiply\u0000the benefits of data-driven insights.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper M2R-Whisper：多阶段、多尺度检索增强技术，用于增强耳语功能

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11889

Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin

{"title":"M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper","authors":"Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin","doi":"arxiv-2409.11889","DOIUrl":"https://doi.org/arxiv-2409.11889","url":null,"abstract":"State-of-the-art models like OpenAI's Whisper exhibit strong performance in\u0000multilingual automatic speech recognition (ASR), but they still face challenges\u0000in accurately recognizing diverse subdialects. In this paper, we propose\u0000M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation\u0000approach designed to enhance ASR performance in low-resource settings. Building\u0000on the principles of in-context learning (ICL) and retrieval-augmented\u0000techniques, our method employs sentence-level ICL in the pre-processing stage\u0000to harness contextual information, while integrating token-level k-Nearest\u0000Neighbors (kNN) retrieval as a post-processing step to further refine the final\u0000output distribution. By synergistically combining sentence-level and\u0000token-level retrieval strategies, M2R-whisper effectively mitigates various\u0000types of recognition errors. Experiments conducted on Mandarin and subdialect\u0000datasets, including AISHELL-1 and KeSpeech, demonstrate substantial\u0000improvements in ASR accuracy, all achieved without any parameter updates.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation METEOR：旋律感知、纹理可控的符号管弦乐生成器

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11753

Dinh-Viet-Toan Le, Yi-Hsuan Yang

引用次数: 0

SALT: Standardized Audio event Label Taxonomy SALT：标准化音频事件标签分类法

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11746

Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI

{"title":"SALT: Standardized Audio event Label Taxonomy","authors":"Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI","doi":"arxiv-2409.11746","DOIUrl":"https://doi.org/arxiv-2409.11746","url":null,"abstract":"Machine listening systems often rely on fixed taxonomies to organize and\u0000label audio data, key for training and evaluating deep neural networks (DNNs)\u0000and other supervised algorithms. However, such taxonomies face significant\u0000constraints: they are composed of application-dependent predefined categories,\u0000which hinders the integration of new or varied sounds, and exhibits limited\u0000cross-dataset compatibility due to inconsistent labeling standards. To overcome\u0000these limitations, we introduce SALT: Standardized Audio event Label Taxonomy.\u0000Building upon the hierarchical structure of AudioSet's ontology, our taxonomy\u0000extends and standardizes labels across 24 publicly available environmental\u0000sound datasets, allowing the mapping of class labels from diverse datasets to a\u0000unified system. Our proposal comes with a new Python package designed for\u0000navigating and utilizing this taxonomy, easing cross-dataset label searching\u0000and hierarchical exploration. Notably, our package allows effortless data\u0000aggregation from diverse sources, hence easy experimentation with combined\u0000datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems 探索基于因果单元（IPU）的端到端智能语音识别系统方法

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11915

Anusha Prakash, Hema A Murthy

引用次数: 0

Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations 利用潜在语音表征模拟母语演讲者影子，进行非母语语音评估

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI: arxiv-2409.11742

Haopeng Geng, Daisuke Saito, Minematsu Nobuaki

{"title":"Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations","authors":"Haopeng Geng, Daisuke Saito, Minematsu Nobuaki","doi":"arxiv-2409.11742","DOIUrl":"https://doi.org/arxiv-2409.11742","url":null,"abstract":"Evaluating speech intelligibility is a critical task in computer-aided\u0000language learning systems. Traditional methods often rely on word error rates\u0000(WER) provided by automatic speech recognition (ASR) as intelligibility scores.\u0000However, this approach has significant limitations due to notable differences\u0000between human speech recognition (HSR) and ASR. A promising alternative is to\u0000involve a native (L1) speaker in shadowing what nonnative (L2) speakers say.\u0000Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can\u0000serve as indicators for assessing L2 speech intelligibility. In this study, we\u0000propose a speech generation system that simulates the L1 shadowing process\u0000using voice conversion (VC) techniques and latent speech representations. Our\u0000experimental results demonstrate that this method effectively replicates the L1\u0000shadowing process, offering an innovative tool to evaluate L2 speech\u0000intelligibility. Notably, systems that utilize self-supervised speech\u0000representations (S3R) show a higher degree of similarity to real L1 shadowing\u0000utterances in both linguistic accuracy and naturalness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Large Language Models By Layerwise Attention Shortcuts 分层注意力快捷方式自适应大型语言模型

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI: arxiv-2409.10870

Prateek Verma, Mert Pilanci

引用次数: 0