{"title":"Inf-MLLM:在单个 GPU 上实现多模态大型语言模型的高效流推理","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\nmultimodal comprehensive ability and widely used in many real-world\napplications including GPT-4o, autonomous driving and robotics. Despite their\nimpressive performance, the multimodal inputs always incur long context. The\ninference under long context requires caching massive Key and Value states (KV\ncache) of previous tokens, which introduces high latency and excessive memory\nconsumption. Due to this reason, it is challenging to deploy streaming\ninference of MLLMs on edge devices, which largely constrains the power and\nusage of MLLMs in real-world applications. In this paper, we introduce\nInf-MLLM, an efficient inference framework for MLLMs, which enable streaming\ninference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\nour key observation of the attention pattern in both LLMs and MLLMs called\n\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\nmaintains a size-constrained KV cache by dynamically caching recent tokens and\nrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\napproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\nenables multiple LLMs and MLLMs to achieve stable performance over 4M-token\nlong texts and multi-round conversations with 1-hour-long videos on a single\nGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\nexisting methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU\",\"authors\":\"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo\",\"doi\":\"arxiv-2409.09086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Large Language Models (MLLMs) are distinguished by their\\nmultimodal comprehensive ability and widely used in many real-world\\napplications including GPT-4o, autonomous driving and robotics. Despite their\\nimpressive performance, the multimodal inputs always incur long context. The\\ninference under long context requires caching massive Key and Value states (KV\\ncache) of previous tokens, which introduces high latency and excessive memory\\nconsumption. Due to this reason, it is challenging to deploy streaming\\ninference of MLLMs on edge devices, which largely constrains the power and\\nusage of MLLMs in real-world applications. In this paper, we introduce\\nInf-MLLM, an efficient inference framework for MLLMs, which enable streaming\\ninference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\\nour key observation of the attention pattern in both LLMs and MLLMs called\\n\\\"attention saddles\\\". Thanks to the newly discovered attention pattern, Inf-MLLM\\nmaintains a size-constrained KV cache by dynamically caching recent tokens and\\nrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\\napproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\\nenables multiple LLMs and MLLMs to achieve stable performance over 4M-token\\nlong texts and multi-round conversations with 1-hour-long videos on a single\\nGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\\nexisting methods such as StreamingLLM and 2x speedup than H2O.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"42 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
Multimodal Large Language Models (MLLMs) are distinguished by their
multimodal comprehensive ability and widely used in many real-world
applications including GPT-4o, autonomous driving and robotics. Despite their
impressive performance, the multimodal inputs always incur long context. The
inference under long context requires caching massive Key and Value states (KV
cache) of previous tokens, which introduces high latency and excessive memory
consumption. Due to this reason, it is challenging to deploy streaming
inference of MLLMs on edge devices, which largely constrains the power and
usage of MLLMs in real-world applications. In this paper, we introduce
Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming
inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on
our key observation of the attention pattern in both LLMs and MLLMs called
"attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM
maintains a size-constrained KV cache by dynamically caching recent tokens and
relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel
approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM
enables multiple LLMs and MLLMs to achieve stable performance over 4M-token
long texts and multi-round conversations with 1-hour-long videos on a single
GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than
existing methods such as StreamingLLM and 2x speedup than H2O.