CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2025-08-08 DOI:10.1109/LCA.2025.3596970

Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim

{"title":"CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX","authors":"Minho Kim;Houxiang Ji;Jaeyoung Kang;Hwanjun Lee;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3596970","DOIUrl":null,"url":null,"abstract":"Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose <monospace>CABANA</monospace>, a cluster-aware query batching for ANNS acceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, <monospace>CABANA</monospace> enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that <monospace>CABANA</monospace> outperforms traditional SIMD-based implementations, achieving up to <inline-formula><tex-math>$32.6\\times$</tex-math></inline-formula> higher query throughput with minimal overhead, while maintaining high recall rates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"289-292"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11120372","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11120372/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA, a cluster-aware query batching for ANNS acceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to

$32.6\times$

higher query throughput with minimal overhead, while maintaining high recall rates.

查看原文本刊更多论文

用Intel AMX加速十亿规模ANNS的集群感知查询批处理

检索增强生成（RAG）系统越来越依赖于近似最近邻搜索（ANNS）来有效地从十亿规模的向量数据库中检索相关上下文。虽然基于ivf的ANNS框架总体上扩展良好，但由于其计算密集型的GEMV操作，特别是在大查询量下，精细搜索阶段仍然是一个瓶颈。为了解决这个问题，我们提出了CABANA，这是一种使用英特尔高级矩阵扩展（AMX）的ANNS加速机制的集群感知查询批处理，它将这些GEMV计算重新制定为高吞吐量的GEMM操作。通过聚合针对相同集群的查询，CABANA可以在精细搜索期间进行批量计算，显著提高计算强度和内存访问规律。对十亿规模数据集的评估表明，CABANA优于传统的基于simd的实现，以最小的开销实现高达32.6倍的查询吞吐量，同时保持高召回率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.