Cosmos：一个基于cxl的全内存系统，用于近似最近邻搜索

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2025-03-14 DOI:10.1109/LCA.2025.3570235

Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn

{"title":"Cosmos：一个基于cxl的全内存系统，用于近似最近邻搜索","authors":"Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn","doi":"10.1109/LCA.2025.3570235","DOIUrl":null,"url":null,"abstract":"Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present <sc>Cosmos</small>, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that <sc>Cosmos</small> achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"173-176"},"PeriodicalIF":1.4000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search\",\"authors\":\"Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn\",\"doi\":\"10.1109/LCA.2025.3570235\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present <sc>Cosmos</small>, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that <sc>Cosmos</small> achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"24 1\",\"pages\":\"173-176\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11004422/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11004422/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

检索增强生成（RAG）通过注入从外部源提取的适当上下文，对于提高大型语言模型的质量至关重要。RAG需要在十亿规模的矢量数据库上进行高吞吐量、低延迟的近似最近邻搜索（ANNS）。传统的DRAM/SSD解决方案面临容量/延迟限制，而专用硬件或RDMA集群缺乏灵活性或导致网络开销。我们提出Cosmos，在CXL存储设备中集成通用内核以实现全ANNS卸载，并引入秩级并行距离计算以最大化内存带宽。我们还提出了一种邻接感知的数据放置方法，该方法基于集群间的接近度平衡跨CXL设备的搜索负载。对SIFT1B和DEEP1B轨迹的评估表明，Cosmos的吞吐量比基线CXL系统高6.72倍，比最先进的基于CXL的解决方案高2.35倍，证明了RAG管道的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search

Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.