IEEE Computer Architecture Letters最新文献_第9页

GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing GATe：简化内存访问和通信，利用近记忆处理加速图形注意网络

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3386734

Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan

{"title":"GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/LCA.2024.3386734","DOIUrl":"10.1109/LCA.2024.3386734","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient \u0000<underline>GAT</u>\u0000 acc\u0000<underline>e</u>\u0000lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"87-90"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health 用于电子医疗高随机性安全的新型混沌系统的面积效率架构

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3387352

Kyriaki Tsantikidou;Nicolas Sklavos

引用次数: 0

The Importance of Generalizability in Machine Learning for Systems 系统机器学习中通用性的重要性

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384449

Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou

引用次数: 0

MajorK: Majority Based kmer Matching in Commodity DRAM MajorK：商品 DRAM 中基于多数的 kmer 匹配

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384259

Z. Jahshan;L. Yavits

引用次数: 0

SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving 面向高能效 LLM 推理服务的 SLO 感知 GPU DVFS

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-28 DOI: 10.1109/LCA.2024.3406038

Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris

引用次数: 0

Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms DRAMATON: 用于大数理论变换的近 DRAM 加速器

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3381452

Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose

引用次数: 0

Characterizing Machine Learning-Based Runtime Prefetcher Selection 描述基于机器学习的运行时首选项选择

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3404887

Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer

引用次数: 0

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference 利用英特尔® 高级矩阵扩展 (AMX) 进行大型语言模型推理

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-24 DOI: 10.1109/LCA.2024.3397747

Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim

{"title":"Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference","authors":"Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim","doi":"10.1109/LCA.2024.3397747","DOIUrl":"10.1109/LCA.2024.3397747","url":null,"abstract":"The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10538369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI 利用基于 LFSR 的随机序列发生器为边缘人工智能设计近似乘法器

IF 2.3 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-19 DOI: 10.1109/LCA.2024.3379002

Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.

引用次数: 0

Hashing ATD Tags for Low-Overhead Safe Contention Monitoring 对 ATD 标签进行散列处理，实现低开销的安全争用监测

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-03-15 DOI: 10.1109/LCA.2024.3401570

Pablo Andreu;Pedro Lopez;Carles Hernandez

引用次数: 0