{"title":"MinCache:一种基于分层嵌入匹配和LLM的高效聊天机器人混合缓存系统","authors":"Keihan Haqiq , Majid Vafaei Jahan , Saeede Anbaee Farimani , Seyed Mahmood Fattahi Masoom","doi":"10.1016/j.future.2025.107822","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the <em>resemblance</em> matching. Additionally, Mincache leverage a sentence-transformer for estimating <em>semantics</em> of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to <span>4.5X</span> against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"170 ","pages":"Article 107822"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM\",\"authors\":\"Keihan Haqiq , Majid Vafaei Jahan , Saeede Anbaee Farimani , Seyed Mahmood Fattahi Masoom\",\"doi\":\"10.1016/j.future.2025.107822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the <em>resemblance</em> matching. Additionally, Mincache leverage a sentence-transformer for estimating <em>semantics</em> of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to <span>4.5X</span> against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"170 \",\"pages\":\"Article 107822\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X25001177\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25001177","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM
Large Language Models (LLMs) have emerged as powerful tools for various natural language processing tasks such as multi-agent chatbots, but their computational complexity and resource requirements pose significant challenges for real-time chatbot applications. Caching strategies can alleviate these challenges by reducing redundant computations and improving response times. In this paper, we propose MinCache, a novel hybrid caching system tailored for LLM applications. Our system employs a hierarchical cache strategy for string retrieval, performing exact match lookups first, followed by resemblance matching, and finally resorting to semantic matching to deliver the most relevant information. MinCache combines the strengths of Least Recently Used (LRU) cache and string fingerprints caching techniques, leveraging MinHash algorithm for fast the resemblance matching. Additionally, Mincache leverage a sentence-transformer for estimating semantics of input prompts. By integrating these approaches, MinCache delivers high cache hit rates, faster response delivery, and improved scalability for LLM applications across diverse domains. Our experiments demonstrate a significant acceleration of LLM applications by up to 4.5X against GPTCache as well as improvements in accurate cache hit rate. We also discuss the scalability of our proposed approach across medical domain chat services.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.