Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee
{"title":"OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems","authors":"Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee","doi":"10.1109/LCA.2025.3567844","DOIUrl":"https://doi.org/10.1109/LCA.2025.3567844","url":null,"abstract":"The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context’s size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"165-168"},"PeriodicalIF":1.4,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration","authors":"Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue","doi":"10.1109/LCA.2025.3548080","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548080","url":null,"abstract":"The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"81-84"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network","authors":"Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan","doi":"10.1109/LCA.2025.3548527","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548527","url":null,"abstract":"High-performance computing applications rely heavily on vector instructions to accelerate data processing. In this letter, we propose a controllable bitonic network (CBN) and use it as a lane interconnect to efficiently rearrange data across vector lanes of a vector processing unit to accelerate the execution of vector permutation instructions (VPIs). Our work focuses on the RISC-V vector instruction set because of its configurable vector length support. Through simulations with vector-permutation-intensive applications of a RISC-V vector benchmark suite (RiVEC), the proposed approach with an eight-lane 64-bit CBN demonstrates an average speedup of ≥6× regarding the VPI execution time over a conventional ring-network-based approach. In addition, to verify our approach on hardware, we implemented a processor system with an eight-lane 16-bit CBN on an AMD A7-100T FPGA operating at 20 MHz, demonstrating single-cycle execution of the RISC-V <italic>vr.gather</i> and <italic>vr.scatter</i> instructions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"133-136"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees","authors":"Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar","doi":"10.1109/LCA.2025.3547262","DOIUrl":"https://doi.org/10.1109/LCA.2025.3547262","url":null,"abstract":"Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by <italic>clipping</i> and adding <italic>noise</i> to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"89-92"},"PeriodicalIF":1.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management","authors":"Mattia Tibaldi;Christian Pilato","doi":"10.1109/LCA.2025.3566553","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566553","url":null,"abstract":"To reduce emerging carbon emissions in cloud computing, we proposed Amethyst, a new VM placement and migration strategy capable of adapting consumption to the currently available green energy. Amethyst tackles the problem on three fronts: it adjusts the consumption to energy production, optimizes execution on FPGA accelerators, and balances execution among servers. We evaluate the strategy with real workloads. Our simulations on CloudSim Plus show that Amethyst effectively reduces the carbon emissions of cloud computing and, compared to the state-of-the-art, it increases the energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"153-156"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fold-PIM: A Cost-Efficient LPDDR5-Based PIM for On-Device SLMs","authors":"Kyoungho Jeun;Hyeonu Kim;Eojin Lee","doi":"10.1109/LCA.2025.3566692","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566692","url":null,"abstract":"The increasing demand for on-device AI applications has shifted focus to Small Language Models (SLMs) optimized for mobile environments. However, the limited memory bandwidth of LPDDR5-based systems presents significant challenges for efficiently executing memory-bound matrix-vector multiplication operations, a core component of SLM inference. In this paper, we propose Fold-PIM, an LPDDR5-based Processing-in-Memory (PIM) architecture designed to address these challenges. Fold-PIM features a shared PU architecture that leverages subarray-level parallelism and employs key techniques with in-tile transposition, adaptive tiling, and a tailored protocol to reduce vector replacement latency. Our evaluation results demonstrate that Fold-PIM achieves up to 3.9× speedup of token generation time in SLM inference compared to the baseline system without PIM.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"185-188"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144206124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit","authors":"Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu","doi":"10.1109/LCA.2025.3546811","DOIUrl":"https://doi.org/10.1109/LCA.2025.3546811","url":null,"abstract":"Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"85-88"},"PeriodicalIF":1.4,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks","authors":"Omer Khan","doi":"10.1109/LCA.2025.3545799","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545799","url":null,"abstract":"Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"73-76"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143698183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture","authors":"Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi","doi":"10.1109/LCA.2025.3541961","DOIUrl":"https://doi.org/10.1109/LCA.2025.3541961","url":null,"abstract":"We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at <uri>https://quarch.ai/</uri>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"105-108"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}