{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks","authors":"Omer Khan","doi":"10.1109/LCA.2025.3545799","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545799","url":null,"abstract":"Graphs-based neural networks have seen tremendous adoption to perform complex predictive analytics on massive real-world graphs. The trend in hardware acceleration has identified significant challenges with harnessing graph locality and workload imbalance due to ultra-sparse and irregular matrix computations at a massively parallel scale. State-of-the-art hardware accelerators utilize massive multithreading and asynchronous execution in GPUs to achieve parallel performance at high power consumption. This paper aims to bridge the power-performance gap using the energy efficiency-centric RISC-V ecosystem. A 1000-core RISC-V processor is proposed to unlock massive parallelism in the graphs-based matrix operators to achieve a low-latency data access paradigm in hardware to achieve robust power-performance scaling. Each core implements a single-threaded pipeline with a novel graph-aware data prefetcher at the 1000 cores scale to deliver an average 20× performance per watt advantage over state-of-the-art NVIDIA GPU.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"73-76"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143698183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture","authors":"Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi","doi":"10.1109/LCA.2025.3541961","DOIUrl":"https://doi.org/10.1109/LCA.2025.3541961","url":null,"abstract":"We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models’ understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles on QAs regarding memory systems and interconnection networks. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and the leaderboard are accessible at <uri>https://quarch.ai/</uri>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"105-108"},"PeriodicalIF":1.4,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator","authors":"Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3545145","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545145","url":null,"abstract":"Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference","authors":"Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten","doi":"10.1109/LCA.2025.3540058","DOIUrl":"https://doi.org/10.1109/LCA.2025.3540058","url":null,"abstract":"Large language models (LLMs) have grown exponentially in size, presenting significant challenges to traditional memory architectures. Current high bandwidth memory (HBM) systems are constrained by chiplet I/O bandwidth and the limited number of HBM stacks that can be integrated due to packaging constraints. In this letter, we propose a novel memory system architecture that leverages silicon photonic interconnects to increase memory capacity and bandwidth for compute devices. By introducing optically connected multi-stack HBM modules, we extend the HBM memory system off the compute chip, significantly increasing the number of HBM stacks. Our evaluations show that this architecture can improve training efficiency for a trillion-parameter model by 1.4× compared to a modeled A100 baseline, while also enhancing inference performance by 4.2× if the L2 is modified to provide sufficient bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park
{"title":"Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization","authors":"Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park","doi":"10.1109/LCA.2025.3532682","DOIUrl":"https://doi.org/10.1109/LCA.2025.3532682","url":null,"abstract":"Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10886951","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs","authors":"Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller","doi":"10.1109/LCA.2025.3539371","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539371","url":null,"abstract":"Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring","authors":"Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh","doi":"10.1109/LCA.2025.3539282","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539282","url":null,"abstract":"Hardware-assisted security features are a powerful tool for safeguarding computing systems against various attacks. However, integrating hardware security features (<italic>HWSFs</i>) within complex System-on-Chip (SoC) architectures often leads to scalability issues and/or resource competition, impacting metrics such as area and power, ultimately leading to an undesirable trade-off between security and performance. In this study, we propose re-evaluating HWSF design constraints in light of the recent paradigm shift from integrated SoCs to chiplet-based architectures. Specifically, we explore the possibility of leveraging a centralized and versatile security module based on chiplets called <italic>security helper chiplets</i>. We study the <italic>cost</i> implications of using such a model by developing a new framework for cost analysis. Our analysis highlights the cost tradeoffs across different design strategies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"61-64"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim
{"title":"RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models","authors":"Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim","doi":"10.1109/LCA.2025.3535470","DOIUrl":"https://doi.org/10.1109/LCA.2025.3535470","url":null,"abstract":"The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce <monospace>RoPIM</monospace>, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. <monospace>RoPIM</monospace> achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, <monospace>RoPIM</monospace> proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that <monospace>RoPIM</monospace> achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters","authors":"Sudhanva Gurumurthi;Mattan Erez","doi":"10.1109/LCA.2025.3528276","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528276","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"iii-iv"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856691","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}