IEEE Computer Architecture Letters最新文献

筛选
英文 中文
OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems OASIS:用于扩展CXL存储系统中LLM推理的离群值感知KV缓存聚类
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-07 DOI: 10.1109/LCA.2025.3567844
Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee
{"title":"OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems","authors":"Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee","doi":"10.1109/LCA.2025.3567844","DOIUrl":"https://doi.org/10.1109/LCA.2025.3567844","url":null,"abstract":"The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context’s size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"165-168"},"PeriodicalIF":1.4,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration 数据模式驱动的LUT在cnn加速中的高效缓存计算
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548080
Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue
{"title":"Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration","authors":"Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue","doi":"10.1109/LCA.2025.3548080","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548080","url":null,"abstract":"The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"81-84"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network 通过可控双元网络加速矢量置换指令的执行
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-05 DOI: 10.1109/LCA.2025.3548527
Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan
{"title":"Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network","authors":"Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan","doi":"10.1109/LCA.2025.3548527","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548527","url":null,"abstract":"High-performance computing applications rely heavily on vector instructions to accelerate data processing. In this letter, we propose a controllable bitonic network (CBN) and use it as a lane interconnect to efficiently rearrange data across vector lanes of a vector processing unit to accelerate the execution of vector permutation instructions (VPIs). Our work focuses on the RISC-V vector instruction set because of its configurable vector length support. Through simulations with vector-permutation-intensive applications of a RISC-V vector benchmark suite (RiVEC), the proposed approach with an eight-lane 64-bit CBN demonstrates an average speedup of ≥6× regarding the VPI execution time over a conventional ring-network-based approach. In addition, to verify our approach on hardware, we implemented a processor system with an eight-lane 16-bit CBN on an AMD A7-100T FPGA operating at 20 MHz, demonstrating single-cycle execution of the RISC-V <italic>vr.gather</i> and <italic>vr.scatter</i> instructions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"133-136"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees DPWatch:一个基于硬件的差分隐私保证框架
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-04 DOI: 10.1109/LCA.2025.3547262
Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar
{"title":"DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees","authors":"Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar","doi":"10.1109/LCA.2025.3547262","DOIUrl":"https://doi.org/10.1109/LCA.2025.3547262","url":null,"abstract":"Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by <italic>clipping</i> and adding <italic>noise</i> to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"89-92"},"PeriodicalIF":1.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management 紫晶:通过动态自动调优和虚拟机管理减少数据中心排放
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-02 DOI: 10.1109/LCA.2025.3566553
Mattia Tibaldi;Christian Pilato
{"title":"Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management","authors":"Mattia Tibaldi;Christian Pilato","doi":"10.1109/LCA.2025.3566553","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566553","url":null,"abstract":"To reduce emerging carbon emissions in cloud computing, we proposed Amethyst, a new VM placement and migration strategy capable of adapting consumption to the currently available green energy. Amethyst tackles the problem on three fronts: it adjusts the consumption to energy production, optimizes execution on FPGA accelerators, and balances execution among servers. We evaluate the strategy with real workloads. Our simulations on CloudSim Plus show that Amethyst effectively reduces the carbon emissions of cloud computing and, compared to the state-of-the-art, it increases the energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"153-156"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fold-PIM: A Cost-Efficient LPDDR5-Based PIM for On-Device SLMs Fold-PIM:一种基于lpddr5的低成本器件级slm PIM
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-03-02 DOI: 10.1109/LCA.2025.3566692
Kyoungho Jeun;Hyeonu Kim;Eojin Lee
{"title":"Fold-PIM: A Cost-Efficient LPDDR5-Based PIM for On-Device SLMs","authors":"Kyoungho Jeun;Hyeonu Kim;Eojin Lee","doi":"10.1109/LCA.2025.3566692","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566692","url":null,"abstract":"The increasing demand for on-device AI applications has shifted focus to Small Language Models (SLMs) optimized for mobile environments. However, the limited memory bandwidth of LPDDR5-based systems presents significant challenges for efficiently executing memory-bound matrix-vector multiplication operations, a core component of SLM inference. In this paper, we propose Fold-PIM, an LPDDR5-based Processing-in-Memory (PIM) architecture designed to address these challenges. Fold-PIM features a shared PU architecture that leverages subarray-level parallelism and employs key techniques with in-tile transposition, adaptive tiling, and a tailored protocol to reduce vector replacement latency. Our evaluation results demonstrate that Fold-PIM achieves up to 3.9× speedup of token generation time in SLM inference compared to the baseline system without PIM.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"185-188"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144206124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit 生成式推荐模型的表征:层次序列转导单元的研究
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-02-28 DOI: 10.1109/LCA.2025.3546811
Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu
{"title":"A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit","authors":"Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu","doi":"10.1109/LCA.2025.3546811","DOIUrl":"https://doi.org/10.1109/LCA.2025.3546811","url":null,"abstract":"Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"85-88"},"PeriodicalIF":1.4,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX 雷神:利用英特尔AMX的非投机值依赖定时侧信道攻击
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-02-27 DOI: 10.1109/LCA.2025.3544989
Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz
{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks 基于数据预取器的1000核RISC-V处理器高效处理图神经网络
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3545799
Omer Khan
{"title":"A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks","authors":"Omer Khan","doi":"10.1109/LCA.2025.3545799","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545799","url":null,"abstract":"Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"73-76"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143698183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture QuArch:计算机体系结构中AI智能体的问答数据集
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3541961
Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi
{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture","authors":"Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi","doi":"10.1109/LCA.2025.3541961","DOIUrl":"https://doi.org/10.1109/LCA.2025.3541961","url":null,"abstract":"We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at <uri>https://quarch.ai/</uri>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"105-108"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信