Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

arXiv - CS - Performance Pub Date : 2024-09-11 DOI:arxiv-2409.09086

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

{"title":"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\nmultimodal comprehensive ability and widely used in many real-world\napplications including GPT-4o, autonomous driving and robotics. Despite their\nimpressive performance, the multimodal inputs always incur long context. The\ninference under long context requires caching massive Key and Value states (KV\ncache) of previous tokens, which introduces high latency and excessive memory\nconsumption. Due to this reason, it is challenging to deploy streaming\ninference of MLLMs on edge devices, which largely constrains the power and\nusage of MLLMs in real-world applications. In this paper, we introduce\nInf-MLLM, an efficient inference framework for MLLMs, which enable streaming\ninference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\nour key observation of the attention pattern in both LLMs and MLLMs called\n\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\nmaintains a size-constrained KV cache by dynamically caching recent tokens and\nrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\napproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\nenables multiple LLMs and MLLMs to achieve stable performance over 4M-token\nlong texts and multi-round conversations with 1-hour-long videos on a single\nGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\nexisting methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.

查看原文本刊更多论文

Inf-MLLM：在单个 GPU 上实现多模态大型语言模型的高效流推理

多模态大语言模型（MLLM）以其多模态综合能力而著称，并广泛应用于 GPT-4o、自动驾驶和机器人等许多现实世界的应用中。尽管多模态模型的性能令人印象深刻，但多模态输入总是会产生长语境。长语境下的推理需要缓存大量以前标记的键和值状态（KVcache），这会带来高延迟和过多的内存消耗。因此，在边缘设备上部署 MLLM 的流式推断具有挑战性，这在很大程度上限制了 MLLM 在实际应用中的功率和使用。在本文中，我们介绍了Inf-MLLM--一种高效的MLLM推理框架，它可以在单个GPU上实现无限上下文的MLLM流式推理。Inf-MLLM 基于我们对 LLM 和 MLLM 中的注意力模式（称为 "注意力鞍"）的关键观察。得益于新发现的注意力模式，Inf-MLLM通过动态缓存最近标记和相关标记来维持大小受限的KV缓存。此外，Inf-MLLM 还提出了注意力偏置（attention bias），这是一种使 MLLM 能够捕捉长期依赖性的新方法。我们的研究表明，Inf-MLLM 使多个 LLM 和 MLLM 能够在单 GPU 上对 4M 标记长度的文本和 1 小时长视频的多轮对话实现稳定的性能。此外，Inf-MLLM 的流推理质量优于 StreamingLLM 等现有方法，速度是 H2O 的 2 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量