IEEE Computer Architecture Letters最新文献

筛选
英文 中文
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture 加速以当代 GPU 微体系结构为目标的可编程引导
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-06-24 DOI: 10.1109/LCA.2024.3418448
Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn
{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"207-210"},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TeleVM: A Lightweight Virtual Machine for RISC-V Architecture TeleVM:适用于 RISC-V 架构的轻量级虚拟机
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-30 DOI: 10.1109/LCA.2024.3394835
Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao
{"title":"TeleVM: A Lightweight Virtual Machine for RISC-V Architecture","authors":"Tianzheng Li;Enfang Cui;Yuting Wu;Qian Wei;Yue Gao","doi":"10.1109/LCA.2024.3394835","DOIUrl":"10.1109/LCA.2024.3394835","url":null,"abstract":"Serverless computing has become an important paradigm in cloud computing due to its advantages such as fast large-scale deployment and pay-as-you-go charging model. Due to shared infrastructure and multi-tenant environments, serverless applications have high security requirements. Traditional virtual machines and containers cannot fully meet the requirements of serverless applications. Therefore, lightweight virtual machine technology has emerged, which can reduce overhead and boot time while ensuring security. In this letter, we propose TeleVM, a lightweight virtual machine for RISC-V architecture. TeleVM can achieve strong isolation through the hypervisor extension of RISC-V. Compared with traditional virtual machines, TeleVM only implements a small number of IO devices and functions, which can effectively reduce memory overhead and boot time. We compared TeleVM and QEMU+KVM through experiments. Compared to QEMU+KVM, the boot time and memory overhead of TeleVM have decreased by 74% and 90% respectively. This work further improves the cloud computing software ecosystem of RISC-V architecture and promotes the use of RISC-V architecture in cloud computing scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"121-124"},"PeriodicalIF":2.3,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM 商业 PIM 系统中的数据传输瓶颈分析:UPMEM-PIM 研究
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-12 DOI: 10.1109/LCA.2024.3387472
Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu
{"title":"Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM","authors":"Dongjae Lee;Bongjoon Hyun;Taehun Kim;Minsoo Rhu","doi":"10.1109/LCA.2024.3387472","DOIUrl":"10.1109/LCA.2024.3387472","url":null,"abstract":"Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This letter characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"179-182"},"PeriodicalIF":1.4,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing GATe:简化内存访问和通信,利用近记忆处理加速图形注意网络
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3386734
Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan
{"title":"GATe: Streamlining Memory Access and Communication to Accelerate Graph Attention Network With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Lingfei Lu;Guohao Xu;Yong Gong;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/LCA.2024.3386734","DOIUrl":"10.1109/LCA.2024.3386734","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT. In this paper, we propose GATe, a practical and efficient \u0000<underline>GAT</u>\u0000 acc\u0000<underline>e</u>\u0000lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. In the attention and aggregation phases, we unify feature vector access to reduce repetitive memory accesses and refine the computation flow to reduce communication. Furthermore, we introduce a novel sharding method that enhances the data reusability. Experiments show that our work achieves substantial speedup of up to 6.77× and 2.46×, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"87-90"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health 用于电子医疗高随机性安全的新型混沌系统的面积效率架构
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-10 DOI: 10.1109/LCA.2024.3387352
Kyriaki Tsantikidou;Nicolas Sklavos
{"title":"An Area Efficient Architecture of a Novel Chaotic System for High Randomness Security in e-Health","authors":"Kyriaki Tsantikidou;Nicolas Sklavos","doi":"10.1109/LCA.2024.3387352","DOIUrl":"10.1109/LCA.2024.3387352","url":null,"abstract":"An e-Health application must be carefully designed, as a malicious attack has ethical and legal consequences. While common cryptography protocols enhance security, they also add high computation overhead. In this letter, an area efficient architecture of a novel chaotic system for high randomness security is proposed. It consists of the chaotic logistic map and a novel component that efficiently combines it with a block cipher's key generation function. The proposed architecture operates as both a key scheduling/management scheme and a stream cipher. All operations are implemented in an FPGA with appropriate resource utilization techniques. The proposed architecture achieves smaller area consumption, minimum 41.5%, compared to published cryptography architectures and a 5.7% increase in throughput-to-area efficiency compared to published chaotic designs. Finally, it passes all NIST randomness tests, presents avalanche effect and produces the highest number of random bits with a single seed compared to other published security systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"104-107"},"PeriodicalIF":2.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Importance of Generalizability in Machine Learning for Systems 系统机器学习中通用性的重要性
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384449
Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou
{"title":"The Importance of Generalizability in Machine Learning for Systems","authors":"Varun Gohil;Sundar Dev;Gaurang Upasani;David Lo;Parthasarathy Ranganathan;Christina Delimitrou","doi":"10.1109/LCA.2024.3384449","DOIUrl":"10.1109/LCA.2024.3384449","url":null,"abstract":"Using machine learning (ML) to tackle computer systems tasks is gaining popularity. One of the shortcomings of such ML-based approaches is the inability of models to generalize to out-of-distribution data i.e., data whose distribution is different than the training dataset. We showcase that this issue exists in cloud environments by analyzing various ML models used to improve resource balance in Google's fleet. We discuss the trade-offs associated with different techniques used to detect out-of-distribution data. Finally, we propose and demonstrate the efficacy of using Bayesian models to detect the model's confidence in its output when used to improve cloud server resource balance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"95-98"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MajorK: Majority Based kmer Matching in Commodity DRAM MajorK:商品 DRAM 中基于多数的 kmer 匹配
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-04-02 DOI: 10.1109/LCA.2024.3384259
Z. Jahshan;L. Yavits
{"title":"MajorK: Majority Based kmer Matching in Commodity DRAM","authors":"Z. Jahshan;L. Yavits","doi":"10.1109/LCA.2024.3384259","DOIUrl":"10.1109/LCA.2024.3384259","url":null,"abstract":"Fast parallel search capabilities on large datasets are required across multiple application domains. One such domain is genome analysis, which requires high-performance \u0000<i>k</i>\u0000mer matching in large genome databases. Recently proposed solutions implemented \u0000<i>k</i>\u0000mer matching in DRAM, utilizing its sheer capacity and parallelism. However, their operation is essentially bit-serial, which ultimately limits the performance, especially when matching long strings, as customary in genome analysis pipelines. The proposed solution, MajorK, enables bit-parallel majority based \u0000<i>k</i>\u0000mer matching in an unmodified commodity DRAM. MajorK employs multiple DRAM row activation, where the search patterns (query \u0000<i>k</i>\u0000mers) are coded into DRAM addresses. We evaluate MajorK on viral genome \u0000<i>k</i>\u0000mer matching and show that it can achieve up to 2.7\u0000<inline-formula><tex-math>$ times $</tex-math></inline-formula>\u0000 higher performance while providing a better matching accuracy compared to state-of-the-art DRAM based \u0000<i>k</i>\u0000mer matching accelerators.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"83-86"},"PeriodicalIF":2.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140569825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving 面向高能效 LLM 推理服务的 SLO 感知 GPU DVFS
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-03-28 DOI: 10.1109/LCA.2024.3406038
Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris
{"title":"SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving","authors":"Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/LCA.2024.3406038","DOIUrl":"10.1109/LCA.2024.3406038","url":null,"abstract":"The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without \u0000<italic>energy overprovisioning</i>\u0000. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"150-153"},"PeriodicalIF":1.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms DRAMATON: 用于大数理论变换的近 DRAM 加速器
IF 2.3 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3381452
Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose
{"title":"Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms","authors":"Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose","doi":"10.1109/LCA.2024.3381452","DOIUrl":"10.1109/LCA.2024.3381452","url":null,"abstract":"With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called \u0000<sc>Dramaton</small>\u0000. Additionally, we introduce a conflict-free mapping algorithm that enables \u0000<sc>Dramaton</small>\u0000 to process large NTTs with minimal hardware overhead using a fixed-permutation network. \u0000<sc>Dramaton</small>\u0000 achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"108-111"},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing Machine Learning-Based Runtime Prefetcher Selection 描述基于机器学习的运行时首选项选择
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-03-27 DOI: 10.1109/LCA.2024.3404887
Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer
{"title":"Characterizing Machine Learning-Based Runtime Prefetcher Selection","authors":"Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer","doi":"10.1109/LCA.2024.3404887","DOIUrl":"10.1109/LCA.2024.3404887","url":null,"abstract":"Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"146-149"},"PeriodicalIF":1.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信