用Intel AMX加速十亿规模ANNS的集群感知查询批处理

IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim
{"title":"用Intel AMX加速十亿规模ANNS的集群感知查询批处理","authors":"Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3596970","DOIUrl":null,"url":null,"abstract":"Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose <monospace>CABANA</monospace>, a <u>c</u>luster-<u>a</u>ware query <u>b</u>atching for <u>AN</u>NS <u>a</u>cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, <monospace>CABANA</monospace> enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that <monospace>CABANA</monospace> outperforms traditional SIMD-based implementations, achieving up to <inline-formula><tex-math>$32.6\\times$</tex-math></inline-formula> higher query throughput with minimal overhead, while maintaining high recall rates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"289-292"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11120372","citationCount":"0","resultStr":"{\"title\":\"CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX\",\"authors\":\"Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim\",\"doi\":\"10.1109/LCA.2025.3596970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose <monospace>CABANA</monospace>, a <u>c</u>luster-<u>a</u>ware query <u>b</u>atching for <u>AN</u>NS <u>a</u>cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, <monospace>CABANA</monospace> enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that <monospace>CABANA</monospace> outperforms traditional SIMD-based implementations, achieving up to <inline-formula><tex-math>$32.6\\\\times$</tex-math></inline-formula> higher query throughput with minimal overhead, while maintaining high recall rates.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"24 2\",\"pages\":\"289-292\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11120372\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11120372/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11120372/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

检索增强生成(RAG)系统越来越依赖于近似最近邻搜索(ANNS)来有效地从十亿规模的向量数据库中检索相关上下文。虽然基于ivf的ANNS框架总体上扩展良好,但由于其计算密集型的GEMV操作,特别是在大查询量下,精细搜索阶段仍然是一个瓶颈。为了解决这个问题,我们提出了CABANA,这是一种使用英特尔高级矩阵扩展(AMX)的ANNS加速机制的集群感知查询批处理,它将这些GEMV计算重新制定为高吞吐量的GEMM操作。通过聚合针对相同集群的查询,CABANA可以在精细搜索期间进行批量计算,显著提高计算强度和内存访问规律。对十亿规模数据集的评估表明,CABANA优于传统的基于simd的实现,以最小的开销实现高达32.6倍的查询吞吐量,同时保持高召回率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX
Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA, a cluster-aware query batching for ANNS acceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to $32.6\times$ higher query throughput with minimal overhead, while maintaining high recall rates.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Computer Architecture Letters
IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
CiteScore
4.60
自引率
4.30%
发文量
29
期刊介绍: IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信