MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06635

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

引用次数: 0

Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

查看原文本刊更多论文

MoWE-Audio：使用弱编码器混合的多任务音频LLMs

大型语言模型（LLM）的快速发展极大地增强了自然语言处理能力，促进了音频LLM 的发展，音频LLM 可以处理和理解语音和音频输入以及文本。现有的音频LLM 通常将预先训练好的音频编码器与预先训练好的 LLM 结合在一起，然后在特定的音频任务中对其进行微调。然而，预先训练好的音频编码器捕捉新任务和数据集特征的能力受到限制。为了解决这个问题，我们建议将 "弱 "编码器混合物（MoWE）纳入音频LLM 框架。MoWE 使用相对较轻的编码器池对基本编码器进行补充，并根据音频输入有选择性地激活，从而在不显著增加模型大小的情况下增强特征提取。我们的实证结果表明，MoWE 有效地提高了多任务性能，扩大了 AudioLLM 在更多样化音频任务中的应用范围。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量