Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie
{"title":"Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text","authors":"Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie","doi":"arxiv-2409.11214","DOIUrl":null,"url":null,"abstract":"Integrating audio encoders with LLMs through connectors has enabled these\nmodels to process and comprehend audio modalities, significantly enhancing\nspeech-to-text tasks, including automatic speech recognition (ASR) and\nautomatic speech translation (AST). However, these methods often overlook the\ncritical aspect of language adaptation in multilingual settings, relying\ninstead on multilingual data without adequately addressing language\ndifferences. To address this gap, we propose the Ideal-LLM model, which employs\ndual multilingual encoders to enrich language feature information and utilizes\na language-adapted connector to target the adaptation of each language\nspecifically. By leveraging the complementary strengths of Whisper and MMS\nencoders, our approach ensures richer multilingual representations.\nAdditionally, the language-adapted connector enhances modal transformation via\na language weight selector tailored for each language. Experimental results\ndemonstrate that Ideal-LLM significantly improves ASR performance, achieving a\n32.6% relative reduction in average word error rates compared to the standard\nspeech encoder integrated with LLMs and yields an average BLEU score of 36.78\nfor AST task.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11214","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Integrating audio encoders with LLMs through connectors has enabled these
models to process and comprehend audio modalities, significantly enhancing
speech-to-text tasks, including automatic speech recognition (ASR) and
automatic speech translation (AST). However, these methods often overlook the
critical aspect of language adaptation in multilingual settings, relying
instead on multilingual data without adequately addressing language
differences. To address this gap, we propose the Ideal-LLM model, which employs
dual multilingual encoders to enrich language feature information and utilizes
a language-adapted connector to target the adaptation of each language
specifically. By leveraging the complementary strengths of Whisper and MMS
encoders, our approach ensures richer multilingual representations.
Additionally, the language-adapted connector enhances modal transformation via
a language weight selector tailored for each language. Experimental results
demonstrate that Ideal-LLM significantly improves ASR performance, achieving a
32.6% relative reduction in average word error rates compared to the standard
speech encoder integrated with LLMs and yields an average BLEU score of 36.78
for AST task.