MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-03-25 DOI:10.1016/j.future.2025.107822

Keihan Haqiq , Majid Vafaei Jahan , Saeede Anbaee Farimani , Seyed Mahmood Fattahi Masoom

{"title":"MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM","authors":"Keihan Haqiq , Majid Vafaei Jahan , Saeede Anbaee Farimani , Seyed Mahmood Fattahi Masoom","doi":"10.1016/j.future.2025.107822","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the <em>resemblance</em> matching. Additionally, Mincache leverage a sentence-transformer for estimating <em>semantics</em> of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to <span>4.5X</span> against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"170 ","pages":"Article 107822"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25001177","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the resemblance matching. Additionally, Mincache leverage a sentence-transformer for estimating semantics of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to 4.5X against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.

查看原文本刊更多论文

MinCache：一种基于分层嵌入匹配和LLM的高效聊天机器人混合缓存系统

大型语言模型（llm）已经成为各种自然语言处理任务（如多智能体聊天机器人）的强大工具，但它们的计算复杂性和资源需求给实时聊天机器人应用带来了重大挑战。缓存策略可以通过减少冗余计算和改进响应时间来缓解这些挑战。在本文中，我们提出了MinCache，一种为LLM应用量身定制的新型混合缓存系统。我们的系统采用分层缓存策略进行字符串检索，首先执行精确匹配查找，然后进行相似性匹配，最后诉诸语义匹配来提供最相关的信息。MinCache结合了最近最少使用（Least Recently Used， LRU）缓存和字符串指纹缓存技术的优点，利用MinHash算法快速进行相似性匹配。此外，Mincache利用一个句子转换器来估计输入提示符的语义。通过集成这些方法，MinCache为不同领域的LLM应用程序提供了高缓存命中率、更快的响应交付和改进的可扩展性。我们的实验证明了LLM应用程序在GPTCache上的显著加速高达4.5倍，并提高了准确的缓存命中率。我们还讨论了我们提出的方法在医疗领域聊天服务中的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.