MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
{"title":"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":null,"url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\nenhanced natural language processing capabilities, facilitating the development\nof AudioLLMs that process and understand speech and audio inputs alongside\ntext. Existing AudioLLMs typically combine a pre-trained audio encoder with a\npre-trained LLM, which are subsequently finetuned on specific audio tasks.\nHowever, the pre-trained audio encoder has constrained capacity to capture\nfeatures for new tasks and datasets. To address this, we propose to incorporate\nmixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\nsupplements a base encoder with a pool of relatively light weight encoders,\nselectively activated based on the audio input to enhance feature extraction\nwithout significantly increasing model size. Our empirical results demonstrate\nthat MoWE effectively improves multi-task performance, broadening the\napplicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
MoWE-Audio:使用弱编码器混合的多任务音频LLMs
大型语言模型(LLM)的快速发展极大地增强了自然语言处理能力,促进了音频LLM 的发展,音频LLM 可以处理和理解语音和音频输入以及文本。现有的音频LLM 通常将预先训练好的音频编码器与预先训练好的 LLM 结合在一起,然后在特定的音频任务中对其进行微调。然而,预先训练好的音频编码器捕捉新任务和数据集特征的能力受到限制。为了解决这个问题,我们建议将 "弱 "编码器混合物(MoWE)纳入音频LLM 框架。MoWE 使用相对较轻的编码器池对基本编码器进行补充,并根据音频输入有选择性地激活,从而在不显著增加模型大小的情况下增强特征提取。我们的实证结果表明,MoWE 有效地提高了多任务性能,扩大了 AudioLLM 在更多样化音频任务中的应用范围。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信