arXiv - EE - Audio and Speech Processing最新文献

筛选
英文 中文
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT 结合多方言音素级 BERT 的跨方言音高附着语言文本到语音技术
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07265
Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari
{"title":"Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT","authors":"Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.07265","DOIUrl":"https://doi.org/arxiv-2409.07265","url":null,"abstract":"We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize\u0000learned speakers' voices in non-native dialects, especially in pitch-accent\u0000languages. CD-TTS is important for developing voice agents that naturally\u0000communicate with people across regions. We present a novel TTS model comprising\u0000three sub-modules to perform competitively at this task. We first train a\u0000backbone TTS model to synthesize dialect speech from a text conditioned on\u0000phoneme-level accent latent variables (ALVs) extracted from speech by a\u0000reference encoder. Then, we train an ALV predictor to predict ALVs tailored to\u0000a target dialect from input text leveraging our novel multi-dialect\u0000phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the\u0000effectiveness of our model by comparing it with a baseline derived from\u0000conventional dialect TTS methods. The results show that our model improves the\u0000dialectal naturalness of synthetic speech in CD-TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching FlowSep:通过整流匹配进行语言查询声音分离
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07614
Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":"https://doi.org/arxiv-2409.07614","url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\u0000using textual descriptions of the desired sources. Current methods mainly use\u0000discriminative approaches, such as time-frequency masking, to separate target\u0000sounds and minimize interference from other sources. However, these models face\u0000challenges when separating overlapping soundtracks, which may lead to artifacts\u0000such as spectral holes or incomplete separation. Rectified flow matching (RFM),\u0000a generative model that establishes linear relations between the distribution\u0000of data and noise, offers superior theoretical properties and simplicity, but\u0000has not yet been explored in sound separation. In this work, we introduce\u0000FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\u0000linear flow trajectories from noise to target source features within the\u0000variational autoencoder (VAE) latent space. During inference, the RFM-generated\u0000latent features are reconstructed into a mel-spectrogram via the pre-trained\u0000VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\u0000Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\u0000models across multiple benchmarks, as evaluated with subjective and objective\u0000metrics. Additionally, our results show that FlowSep surpasses a\u0000diffusion-based LASS model in both separation quality and inference efficiency,\u0000highlighting its strong potential for audio source separation tasks. Code,\u0000pre-trained models and demos can be found at:\u0000https://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection 声源定位的分析类增量学习与隐私保护
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07224
Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li
{"title":"Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection","authors":"Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li","doi":"arxiv-2409.07224","DOIUrl":"https://doi.org/arxiv-2409.07224","url":null,"abstract":"Sound Source Localization (SSL) enabling technology for applications such as\u0000surveillance and robotics. While traditional Signal Processing (SP)-based SSL\u0000methods provide analytic solutions under specific signal and noise assumptions,\u0000recent Deep Learning (DL)-based methods have significantly outperformed them.\u0000However, their success depends on extensive training data and substantial\u0000computational resources. Moreover, they often rely on large-scale annotated\u0000spatial data and may struggle when adapting to evolving sound classes. To\u0000mitigate these challenges, we propose a novel Class Incremental Learning (CIL)\u0000approach, termed SSL-CIL, which avoids serious accuracy degradation due to\u0000catastrophic forgetting by incrementally updating the DL-based SSL model\u0000through a closed-form analytic solution. In particular, data privacy is ensured\u0000since the learning process does not revisit any historical data\u0000(exemplar-free), which is more suitable for smart home scenarios. Empirical\u0000results in the public SSLR dataset demonstrate the superior performance of our\u0000proposal, achieving a localization accuracy of 90.9%, surpassing other\u0000competitive methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array 使用圆形麦克风阵列为多扬声器场景进行神经 Ambisonic 编码
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.06954
Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu
{"title":"Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array","authors":"Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu","doi":"arxiv-2409.06954","DOIUrl":"https://doi.org/arxiv-2409.06954","url":null,"abstract":"Spatial audio formats like Ambisonics are playback device layout-agnostic and\u0000well-suited for applications such as teleconferencing and virtual reality.\u0000Conventional Ambisonic encoding methods often rely on spherical microphone\u0000arrays for efficient sound field capture, which limits their flexibility in\u0000practical scenarios. We propose a deep learning (DL)-based approach, leveraging\u0000a two-stage network architecture for encoding circular microphone array signals\u0000into second-order Ambisonics (SOA) in multi-speaker environments. In addition,\u0000we introduce: (i) a novel loss function based on spatial power maps to\u0000regularize inter-channel correlations of the Ambisonic signals, and (ii) a\u0000channel permutation technique to resolve the ambiguity of encoding vertical\u0000information using a horizontal circular array. Evaluation on simulated speech\u0000and noise datasets shows that our approach consistently outperforms traditional\u0000signal processing (SP) and DL-based methods, providing significantly better\u0000timbral and spatial quality and higher source localization accuracy. Binaural\u0000audio demos with visualizations are available at\u0000https://bridgoon97.github.io/NeuralAmbisonicEncoding/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis SSR-Speech:实现基于文本的稳定、安全和鲁棒的零镜头语音编辑与合成
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07556
Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu
{"title":"SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis","authors":"Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu","doi":"arxiv-2409.07556","DOIUrl":"https://doi.org/arxiv-2409.07556","url":null,"abstract":"In this paper, we introduce SSR-Speech, a neural codec autoregressive model\u0000designed for stable, safe, and robust zero-shot text-based speech editing and\u0000text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and\u0000incorporates classifier-free guidance to enhance the stability of the\u0000generation process. A watermark Encodec is proposed to embed frame-level\u0000watermarks into the edited regions of the speech so that which parts were\u0000edited can be detected. In addition, the waveform reconstruction leverages the\u0000original unedited speech segments, providing superior recovery compared to the\u0000Encodec model. Our approach achieves the state-of-the-art performance in the\u0000RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing\u0000previous methods. Furthermore, SSR-Speech excels in multi-span speech editing\u0000and also demonstrates remarkable robustness to background sounds. Source code\u0000and demos are released.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VoiceWukong: Benchmarking Deepfake Voice Detection 悟空语音:深度伪语音检测基准测试
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06348
Ziwei Yan, Yanjie Zhao, Haoyu Wang
{"title":"VoiceWukong: Benchmarking Deepfake Voice Detection","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":"https://doi.org/arxiv-2409.06348","url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\u0000voice conversion (VC), detecting deepfake voices has become increasingly\u0000crucial. However, both academia and industry lack a comprehensive and intuitive\u0000benchmark for evaluating detectors. Existing datasets are limited in language\u0000diversity and lack many manipulations encountered in real-world production\u0000environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\u0000the performance of deepfake voice detectors. To build the dataset, we first\u0000collected deepfake voices generated by 19 advanced and widely recognized\u0000commercial tools and 15 open-source tools. We then created 38 data variants\u0000covering six types of manipulations, constructing the evaluation dataset for\u0000deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\u0000Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12\u0000state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\u000013.50%, while all others exceeded 20%. Our findings reveal that these detectors\u0000face significant challenges in real-world applications, with dramatically\u0000declining performance. In addition, we conducted a user study with more than\u0000300 participants. The results are compared with the performance of the 12\u0000detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\u0000where different detectors and humans exhibit varying identification\u0000capabilities for deepfake voices at different deception levels, while the LALM\u0000demonstrates no detection ability at all. Furthermore, we provide a leaderboard\u0000for deepfake voice detection, publicly available at\u0000{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself InstructSing:通过自学生成高保真歌声
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06330
Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen
{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":"https://doi.org/arxiv-2409.06330","url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\u0000high-quality generated voices and acceptable inference speed. In this paper, we\u0000propose a novel neural vocoder called InstructSing, which can converge much\u0000faster compared with other neural vocoders while maintaining good performance\u0000by integrating differentiable digital signal processing and adversarial\u0000training. It includes one generator and two discriminators. Specifically, the\u0000generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\u0000as an instructive signal. Subsequently, the HN module is connected with an\u0000extended WaveNet by an UNet-based module, which transforms the output of the HN\u0000module to a latent variable sequence containing essential periodic and\u0000aperiodic information. In addition to the latent sequence, the extended WaveNet\u0000also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\u0000voices. In terms of discriminators, we combine a multi-period discriminator, as\u0000originally proposed in HiFiGAN, with a multi-resolution multi-band STFT\u0000discriminator. Notably, InstructSing achieves comparable voice quality to other\u0000neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\u0000V100 GPU machinefootnote{{Demo page:\u0000href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.\u0000We plan to open-source our code and pretrained model once the paper get\u0000accepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches 识别欺骗的扬声器验证可抵御领域和信道错配
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06327
Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi
{"title":"Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches","authors":"Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi","doi":"arxiv-2409.06327","DOIUrl":"https://doi.org/arxiv-2409.06327","url":null,"abstract":"In real-world applications, it is challenging to build a speaker verification\u0000system that is simultaneously robust against common threats, including spoofing\u0000attacks, channel mismatch, and domain mismatch. Traditional automatic speaker\u0000verification (ASV) systems often tackle these issues separately, leading to\u0000suboptimal performance when faced with simultaneous challenges. In this paper,\u0000we propose an integrated framework that incorporates pair-wise learning and\u0000spoofing attack simulation into the meta-learning paradigm to enhance\u0000robustness against these multifaceted threats. This novel approach employs an\u0000asymmetric dual-path model and a multi-task learning strategy to handle ASV,\u0000anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing\u0000dataset, CNComplex, is introduced to evaluate system performance under these\u0000combined threats. Experimental results demonstrate that our integrated model\u0000significantly improves performance over traditional ASV systems across various\u0000scenarios, showcasing its potential for real-world deployment. Additionally,\u0000the proposed framework's ability to generalize across different conditions\u0000highlights its robustness and reliability, making it a promising solution for\u0000practical ASV applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models 增强音频问题解答中的时态理解,建立大型音频语言模型
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06223
Arvind Krishna Sridhar, Yinyi Guo, Erik Visser
{"title":"Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models","authors":"Arvind Krishna Sridhar, Yinyi Guo, Erik Visser","doi":"arxiv-2409.06223","DOIUrl":"https://doi.org/arxiv-2409.06223","url":null,"abstract":"The Audio Question Answering task includes audio event classification, audio\u0000captioning, and open ended reasoning. Recently, Audio Question Answering has\u0000garnered attention due to the advent of Large Audio Language Models. Current\u0000literature focuses on constructing LALMs by integrating audio encoders with\u0000text only Large Language Models through a projection module. While Large Audio\u0000Language Models excel in general audio understanding, they are limited in\u0000temporal reasoning which may hinder their commercial applications and on device\u0000deployment. This paper addresses these challenges and limitations in audio\u0000temporal reasoning. First, we introduce a data augmentation technique for\u0000generating reliable audio temporal questions and answers using an LLM. Second,\u0000we propose a continued finetuning curriculum learning strategy to specialize in\u0000temporal reasoning without compromising performance on finetuned tasks.\u0000Finally, we develop a reliable and transparent automated metric, assisted by an\u0000LLM, to measure the correlation between Large Audio Language Model responses\u0000and ground truth data intelligently. We demonstrate the effectiveness of our\u0000proposed techniques using SOTA LALMs on public audio benchmark datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders MoWE-Audio:使用弱编码器混合的多任务音频LLMs
arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06635
Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
{"title":"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":"https://doi.org/arxiv-2409.06635","url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\u0000enhanced natural language processing capabilities, facilitating the development\u0000of AudioLLMs that process and understand speech and audio inputs alongside\u0000text. Existing AudioLLMs typically combine a pre-trained audio encoder with a\u0000pre-trained LLM, which are subsequently finetuned on specific audio tasks.\u0000However, the pre-trained audio encoder has constrained capacity to capture\u0000features for new tasks and datasets. To address this, we propose to incorporate\u0000mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\u0000supplements a base encoder with a pool of relatively light weight encoders,\u0000selectively activated based on the audio input to enhance feature extraction\u0000without significantly increasing model size. Our empirical results demonstrate\u0000that MoWE effectively improves multi-task performance, broadening the\u0000applicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信