IEEE Computer Architecture Letters最新文献_第3页

Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX 雷神：利用英特尔AMX的非投机值依赖定时侧信道攻击

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-27 DOI: 10.1109/LCA.2025.3544989

Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz

{"title":"Thor: A Non-Speculative Value Dependent Timing Side Channel Attack Exploiting Intel AMX","authors":"Farshad Dizani;Azam Ghanbari;Joshua Kalyanapu;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3544989","DOIUrl":"https://doi.org/10.1109/LCA.2025.3544989","url":null,"abstract":"The rise of on-chip accelerators signifies a major shift in computing, driven by the growing demands of artificial intelligence (AI) and specialized applications. These accelerators have gained popularity due to their ability to substantially boost performance, cut energy usage, lower total cost of ownership (TCO), and promote sustainability. Intel's Advanced Matrix Extensions (AMX) is one such on-chip accelerator, specifically designed for handling tasks involving large matrix multiplications commonly used in machine learning (ML) models, image processing, and other computational-heavy operations. In this paper, we introduce a novel value-dependent timing side-channel vulnerability in Intel AMX. By exploiting this weakness, we demonstrate a software-based, value-dependent timing side-channel attack capable of inferring the sparsity of neural network weights without requiring any knowledge of the confidence score, privileged access or physical proximity. Our attack method can fully recover the sparsity of weights assigned to 64 input elements within 50 minutes, which is 631% faster than the maximum leakage rate achieved in the Hertzbleed attack.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks 基于数据预取器的1000核RISC-V处理器高效处理图神经网络

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3545799

Omer Khan

引用次数: 0

QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture QuArch：计算机体系结构中AI智能体的问答数据集

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-26 DOI: 10.1109/LCA.2025.3541961

Shvetank Prakash;Andrew Cheng;Jason Yik;Arya Tschand;Radhika Ghosal;Ikechukwu Uchendu;Jessica Quaye;Jeffrey Ma;Shreyas Grampurohit;Sofia Giannuzzi;Arnav Balyan;Fin Amin;Aadya Pipersenia;Yash Choudhary;Ankita Nayak;Amir Yazdanbakhsh;Vijay Janapa Reddi

引用次数: 0

A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator 基于dsp的高精度可扩展MAC混合数据流用于任意基量化CNN加速器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-24 DOI: 10.1109/LCA.2025.3545145

Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong

{"title":"A DSP-Based Precision-Scalable MAC With Hybrid Dataflow for Arbitrary-Basis-Quantization CNN Accelerator","authors":"Yuanmiao Lin;Shansen Fu;Xueming Li;Chaoming Yang;Rongfeng Li;Hongmin Huang;Xianghong Hu;Shuting Cai;Xiaoming Xiong","doi":"10.1109/LCA.2025.3545145","DOIUrl":"https://doi.org/10.1109/LCA.2025.3545145","url":null,"abstract":"Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optically Connected Multi-Stack HBM Modules for Large Language Model Training and Inference 用于大型语言模型训练和推理的光连接多栈HBM模块

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-18 DOI: 10.1109/LCA.2025.3540058

Yanghui Ou;Hengrui Zhang;Austin Rovinski;David Wentzlaff;Christopher Batten

引用次数: 0

Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization 成组LLM量化中DRAM-PIM的成本效益扩展

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-13 DOI: 10.1109/LCA.2025.3532682

Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park

{"title":"Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization","authors":"Byeori Kim;Changhun Lee;Gwangsun Kim;Eunhyeok Park","doi":"10.1109/LCA.2025.3532682","DOIUrl":"https://doi.org/10.1109/LCA.2025.3532682","url":null,"abstract":"Processing-in-Memory (PIM) is emerging as a promising next-generation hardware to address memory bottlenecks in large language model (LLM) inference by leveraging internal memory bandwidth, enabling more energy-efficient on-device AI. However, LLMs’ large footprint poses significant challenges for accelerating them on PIM due to limited available space. Recent advances in weight-only quantization, especially group-wise weight quantization (GWQ), reduce LLM model sizes, enabling parameters to be stored at 4-bit precision or lower with minimal accuracy loss. Despite this, current PIM architectures experience performance degradation when handling the additional computations required for quantized weights. While incorporating extra logic could mitigate this degradation, it is often prohibitively expensive due to the constraints of memory technology, necessitating solutions with minimal area overhead. This work introduces two key innovations: 1) scale cascading, and 2) an INT2FP converter, to support GWQ-applied LLMs on PIM with minimal dequantization latency and area overhead compared to FP16 GEMV. Experimental results show that the proposed approach adds less than 0.6% area overhead to the existing PIM unit and achieves a 7% latency overhead for dequantization and GEMV in 4-bit GWQ with a group size of 128, compared to FP16 GEMV, while offering a 1.55× performance gain over baseline dequantization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10886951","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143553436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs 图形处理器上图形神经网络聚合的综合设计空间探索

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539371

Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller

{"title":"Comprehensive Design Space Exploration for Graph Neural Network Aggregation on GPUs","authors":"Hyunwoo Nam;Jay Hwan Lee;Shinhyung Yang;Yeonsoo Kim;Jiun Jeong;Jeonggeun Kim;Bernd Burgstaller","doi":"10.1109/LCA.2025.3539371","DOIUrl":"https://doi.org/10.1109/LCA.2025.3539371","url":null,"abstract":"Graph neural networks (GNNs) have become the state-of-the-art technology for extracting and predicting data representations on graphs. With increasing demand to accelerate GNN computations, the GPU has become the dominant platform for GNN training and inference. GNNs consist of a compute-bound combination phase and a memory-bound aggregation phase. The memory access patterns of the aggregation phase remain a major performance bottleneck on GPUs, despite recent microarchitectural enhancements. Although GNN characterizations have been conducted to investigate this bottleneck, they did not reveal the impact of architectural modifications. However, a comprehensive understanding of improvements from such modifications is imperative to devise GPU optimizations for the aggregation phase. In this letter, we explore the GPU design space for aggregation by assessing the performance improvement potential of a series of architectural modifications. We find that the low locality of aggregation deteriorates performance with increased thread-level parallelism, and a significant enhancement follows memory access optimizations, which remain effective even with software optimization. Our analysis provides insights for hardware optimizations to significantly improve GNN aggregation on GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring 安全辅助小芯片：安全硬件监控的新范例

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-02-06 DOI: 10.1109/LCA.2025.3539282

Pooya Aghanoury;Santosh Ghosh;Nader Sehatbakhsh

引用次数: 0

RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models RoPIM：一种加速变压器模型旋转位置嵌入的内存处理架构

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3535470

Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim

{"title":"RoPIM: A Processing-in-Memory Architecture for Accelerating Rotary Positional Embedding in Transformer Models","authors":"Yunhyeong Jeon;Minwoo Jang;Hwanjun Lee;Yeji Jung;Jin Jung;Jonggeon Lee;Jinin So;Daehoon Kim","doi":"10.1109/LCA.2025.3535470","DOIUrl":"https://doi.org/10.1109/LCA.2025.3535470","url":null,"abstract":"The emergence of attention-based Transformer models, such as GPT, BERT, and LLaMA, has revolutionized Natural Language Processing (NLP) by significantly improving performance across a wide range of applications. A critical factor driving these improvements is the use of positional embeddings, which are crucial for capturing the contextual relationships between tokens in a sequence. However, current positional embedding methods face challenges, particularly in managing performance overhead for long sequences and effectively capturing relationships between adjacent tokens. In response, Rotary Positional Embedding (RoPE) has emerged as a method that effectively embeds positional information with high accuracy and without necessitating model retraining even with long sequences. Despite its effectiveness, RoPE introduces a considerable performance bottleneck during inference. We observe that RoPE accounts for 61% of GPU execution time due to extensive data movement and execution dependencies. In this paper, we introduce <monospace>RoPIM</monospace>, a Processing-In-Memory (PIM) architecture designed to efficiently accelerate RoPE operations in Transformer models. <monospace>RoPIM</monospace> achieves this by utilizing a bank-level accelerator that reduces off-chip data movement through in-accelerator support for multiply-addition operations and minimizes operational dependencies via parallel data rearrangement. Additionally, <monospace>RoPIM</monospace> proposes an optimized data mapping strategy that leverages both bank-level and row-level mappings to enable parallel execution, eliminate bank-to-bank communication, and reduce DRAM activations. Our experimental results show that <monospace>RoPIM</monospace> achieves up to a 307.9× performance improvement and 914.1× energy savings compared to conventional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters 社论：一封来自IEEE计算机体系结构快报总编辑的信

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3528276

Sudhanva Gurumurthi;Mattan Erez

引用次数: 0