LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
{"title":"LLaMA-Omni: Seamless Speech Interaction with Large Language Models","authors":"Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng","doi":"arxiv-2409.06666","DOIUrl":null,"url":null,"abstract":"Models like GPT-4o enable real-time interaction with large language models\n(LLMs) through speech, significantly enhancing user experience compared to\ntraditional text-based interaction. However, there is still a lack of\nexploration on how to build speech interaction models based on open-source\nLLMs. To address this, we propose LLaMA-Omni, a novel model architecture\ndesigned for low-latency and high-quality speech interaction with LLMs.\nLLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM,\nand a streaming speech decoder. It eliminates the need for speech\ntranscription, and can simultaneously generate text and speech responses\ndirectly from speech instructions with extremely low latency. We build our\nmodel based on the latest Llama-3.1-8B-Instruct model. To align the model with\nspeech interaction scenarios, we construct a dataset named InstructS2S-200K,\nwhich includes 200K speech instructions and corresponding speech responses.\nExperimental results show that compared to previous speech-language models,\nLLaMA-Omni provides better responses in both content and style, with a response\nlatency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3\ndays on just 4 GPUs, paving the way for the efficient development of\nspeech-language models in the future.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.
LLaMA-Omni:利用大型语言模型实现无缝语音交互
像 GPT-4o 这样的模型能够通过语音与大型语言模型(LLM)进行实时交互,与传统的基于文本的交互相比,大大提升了用户体验。然而,在如何基于开源 LLMs 建立语音交互模型方面仍然缺乏探索。为了解决这个问题,我们提出了 LLaMA-Omni,这是一种专为与 LLM 进行低延迟、高质量语音交互而设计的新型模型架构。LLaMA-Omni 集成了预训练语音编码器、语音适配器、LLM 和流式语音解码器。LLaMA-Omni 集成了预先训练好的语音编码器、语音适配器、LLM 和流式语音解码器。它无需进行语音转录,可以同时以极低的延迟直接从语音指令生成文本和语音响应。我们的模型基于最新的 Llama-3.1-8B-Instruct 模型。实验结果表明,与以前的语音语言模型相比,LLaMA-Omni 在内容和风格上都能提供更好的响应,响应延迟低至 226 毫秒。此外,在 4 个 GPU 上训练 LLaMA-Omni 仅需不到 3 天的时间,这为未来高效开发语音语言模型铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信