Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
{"title":"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":null,"url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\nenhanced natural language processing capabilities, facilitating the development\nof AudioLLMs that process and understand speech and audio inputs alongside\ntext. Existing AudioLLMs typically combine a pre-trained audio encoder with a\npre-trained LLM, which are subsequently finetuned on specific audio tasks.\nHowever, the pre-trained audio encoder has constrained capacity to capture\nfeatures for new tasks and datasets. To address this, we propose to incorporate\nmixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\nsupplements a base encoder with a pool of relatively light weight encoders,\nselectively activated based on the audio input to enhance feature extraction\nwithout significantly increasing model size. Our empirical results demonstrate\nthat MoWE effectively improves multi-task performance, broadening the\napplicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid advancements in large language models (LLMs) have significantly
enhanced natural language processing capabilities, facilitating the development
of AudioLLMs that process and understand speech and audio inputs alongside
text. Existing AudioLLMs typically combine a pre-trained audio encoder with a
pre-trained LLM, which are subsequently finetuned on specific audio tasks.
However, the pre-trained audio encoder has constrained capacity to capture
features for new tasks and datasets. To address this, we propose to incorporate
mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE
supplements a base encoder with a pool of relatively light weight encoders,
selectively activated based on the audio input to enhance feature extraction
without significantly increasing model size. Our empirical results demonstrate
that MoWE effectively improves multi-task performance, broadening the
applicability of AudioLLMs to more diverse audio tasks.