LLaMA-Omni: Seamless Speech Interaction with Large Language Models

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06666

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng

{"title":"LLaMA-Omni: Seamless Speech Interaction with Large Language Models","authors":"Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng","doi":"arxiv-2409.06666","DOIUrl":null,"url":null,"abstract":"Models like GPT-4o enable real-time interaction with large language models\n(LLMs) through speech, significantly enhancing user experience compared to\ntraditional text-based interaction. However, there is still a lack of\nexploration on how to build speech interaction models based on open-source\nLLMs. To address this, we propose LLaMA-Omni, a novel model architecture\ndesigned for low-latency and high-quality speech interaction with LLMs.\nLLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM,\nand a streaming speech decoder. It eliminates the need for speech\ntranscription, and can simultaneously generate text and speech responses\ndirectly from speech instructions with extremely low latency. We build our\nmodel based on the latest Llama-3.1-8B-Instruct model. To align the model with\nspeech interaction scenarios, we construct a dataset named InstructS2S-200K,\nwhich includes 200K speech instructions and corresponding speech responses.\nExperimental results show that compared to previous speech-language models,\nLLaMA-Omni provides better responses in both content and style, with a response\nlatency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3\ndays on just 4 GPUs, paving the way for the efficient development of\nspeech-language models in the future.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

查看原文本刊更多论文

LLaMA-Omni：利用大型语言模型实现无缝语音交互

像 GPT-4o 这样的模型能够通过语音与大型语言模型（LLM）进行实时交互，与传统的基于文本的交互相比，大大提升了用户体验。然而，在如何基于开源 LLMs 建立语音交互模型方面仍然缺乏探索。为了解决这个问题，我们提出了 LLaMA-Omni，这是一种专为与 LLM 进行低延迟、高质量语音交互而设计的新型模型架构。LLaMA-Omni 集成了预训练语音编码器、语音适配器、LLM 和流式语音解码器。LLaMA-Omni 集成了预先训练好的语音编码器、语音适配器、LLM 和流式语音解码器。它无需进行语音转录，可以同时以极低的延迟直接从语音指令生成文本和语音响应。我们的模型基于最新的 Llama-3.1-8B-Instruct 模型。实验结果表明，与以前的语音语言模型相比，LLaMA-Omni 在内容和风格上都能提供更好的响应，响应延迟低至 226 毫秒。此外，在 4 个 GPU 上训练 LLaMA-Omni 仅需不到 3 天的时间，这为未来高效开发语音语言模型铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量