Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI:arxiv-2409.10999

Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

{"title":"Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models","authors":"Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul","doi":"arxiv-2409.10999","DOIUrl":null,"url":null,"abstract":"Audio language models can understand audio inputs and perform a range of\naudio-related tasks based on instructions, such as speech recognition and audio\ncaptioning, where the instructions are usually textual prompts. Audio language\nmodels are mostly initialized from pre-trained audio encoders and large\nlanguage models (LLMs). Although these pre-trained components were developed to\nsupport multiple languages, audio-language models are trained predominantly on\nEnglish data, which may limit their usability to only English instructions or\nEnglish speech inputs. First, this paper examines the performance of existing\naudio language models in an underserved language using Thai as an example. This\npaper demonstrates that, despite being built on multilingual backbones, audio\nlanguage models do not exhibit cross-lingual emergent abilities to low-resource\nlanguages. Second, this paper studies data mixture for developing audio\nlanguage models that are optimized for a target language as well as English. In\naddition. this paper integrates audio comprehension and speech\ninstruction-following capabilities into a single unified model. Our experiments\nprovide insights into data mixture for enhancing instruction-following\ncapabilities in both a low-resource language and English. Our model,\nTyphoon-Audio, outperforms existing open-source audio language models by a\nconsiderable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in\nboth English and Thai languages.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"167 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.

查看原文本刊更多论文

增强有声语言模型的低资源语言和教学跟踪能力

音频语言模型可以理解音频输入，并根据指令执行一系列与音频相关的任务，如语音识别和字幕制作，其中的指令通常是文本提示。音频语言模型大多由预先训练好的音频编码器和大型语言模型（LLM）初始化而成。虽然这些预训练组件是为支持多种语言而开发的，但音频语言模型主要是在英语数据上训练的，这可能会限制它们仅在英语指令或英语语音输入时的可用性。首先，本文以泰语为例，研究了现有音频语言模型在未得到充分服务的语言中的性能。本文表明，尽管音频语言模型是建立在多语言基础之上的，但在低资源语言方面并没有表现出跨语言的新兴能力。其次，本文研究了开发针对目标语言和英语进行优化的听力语言模型的数据混合物。此外，本文还将音频理解能力和语音指令跟读能力整合到一个统一的模型中。我们的实验为提高低资源语言和英语的教学跟读能力提供了数据混合物方面的见解。我们的模型 Typhoon-Audio 在英语和泰语中的表现远远优于现有的开源音频语言模型，并可与最先进的 Gemini-1.5-Pro 相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量