IEEE Computer Architecture Letters最新文献

筛选
英文 中文
Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware 在可重构硬件上高效实现 Knuth Yao 采样器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-09-03 DOI: 10.1109/LCA.2024.3454490
Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath
{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization SmartQuant:基于 CXL 的人工智能模型存储,支持运行时可配置的权重量化
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-09-02 DOI: 10.1109/LCA.2024.3452699
Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
{"title":"SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization","authors":"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2024.3452699","DOIUrl":"10.1109/LCA.2024.3452699","url":null,"abstract":"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training 在冷数据上主动嵌入,用于深度学习推荐模型训练
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-08-28 DOI: 10.1109/LCA.2024.3445948
Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim
{"title":"Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training","authors":"Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim","doi":"10.1109/LCA.2024.3445948","DOIUrl":"10.1109/LCA.2024.3445948","url":null,"abstract":"Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how \u0000<italic>cold</i>\u0000 data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Octopus: A Cycle-Accurate Cache System Simulator 章鱼:周期精确的高速缓存系统模拟器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-08-12 DOI: 10.1109/LCA.2024.3441941
Mohamed Hossam;Salah Hessien;Mohamed Hassan
{"title":"Octopus: A Cycle-Accurate Cache System Simulator","authors":"Mohamed Hossam;Salah Hessien;Mohamed Hassan","doi":"10.1109/LCA.2024.3441941","DOIUrl":"10.1109/LCA.2024.3441941","url":null,"abstract":"This paper introduces Octopus\u0000<sup>1</sup>\u0000, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements 面向周期的动态逼近:满足性能要求的架构框架
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-08-06 DOI: 10.1109/LCA.2024.3439318
Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai
{"title":"Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements","authors":"Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2024.3439318","DOIUrl":"10.1109/LCA.2024.3439318","url":null,"abstract":"Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC LTE:用于后量子方案 HQC 的轻量级省时硬件编码器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-07-30 DOI: 10.1109/LCA.2024.3435495
Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie
{"title":"LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC","authors":"Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie","doi":"10.1109/LCA.2024.3435495","DOIUrl":"10.1109/LCA.2024.3435495","url":null,"abstract":"Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a \u0000<bold>L</b>\u0000ightweight and \u0000<bold>T</b>\u0000ime-\u0000<bold>E</b>\u0000fficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architecting Compatible PIM Protocol for CPU-PIM Collaboration 为 CPU-PIM 协作构建兼容的 PIM 协议
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-07-24 DOI: 10.1109/LCA.2024.3432936
Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee
{"title":"Architecting Compatible PIM Protocol for CPU-PIM Collaboration","authors":"Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee","doi":"10.1109/LCA.2024.3432936","DOIUrl":"10.1109/LCA.2024.3432936","url":null,"abstract":"Processing in Memory (PIM) technology is gaining traction with the introduction of several prototype products. However, the interfaces of existing PIM devices hinder CPU performance excessively by delaying normal memory requests for long periods during PIM operations. In this paper, we propose a new PIM command and protocol designed for compatibility across various PIM devices and host processors, focusing on DRAM standards with limited command space. Our proposed command, PIM-ACT, activates multiple banks simultaneously with assigning the specific PIM operation. It closely follows the functionality of the ACT command for straightforward control by the memory controller. We also explore memory scheduling policies that balance the latency of conventional memory requests with the throughput of PIM workloads. Our evaluation demonstrates the effectiveness of our approach in optimizing both PIM and conventional workload performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos 基于状态空间模型的大型语言模型定量分析:饥饿的河马》研究
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-07-03 DOI: 10.1109/LCA.2024.3422492
Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu
{"title":"A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos","authors":"Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu","doi":"10.1109/LCA.2024.3422492","DOIUrl":"10.1109/LCA.2024.3422492","url":null,"abstract":"As the need for processing long contexts in large language models (LLMs) increases, attention-based LLMs face significant challenges due to their high computation and memory requirements. To overcome this challenge, there have been several recent works that seek to alleviate attention's system-level bottlenecks. An approach that has been receiving a lot of attraction lately is state space models (SSMs) thanks to their ability to substantially reduce computational complexity and memory footprint. Despite the excitement around SSMs, there is a lack of an in-depth characterization and analysis on this important model architecture. In this paper, we delve into a representative SSM named Hungry Hungry Hippos (H3), examining its advantages as well as its current limitations. We also discuss future research directions on improving the efficiency of SSMs via hardware architectural support.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems 有关 Petascale 全闪存存储系统性能可扩展性的经验架构分析
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-06-25 DOI: 10.1109/LCA.2024.3418874
Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi
{"title":"Empirical Architectural Analysis on Performance Scalability of Petascale All-Flash Storage Systems","authors":"Mohammadamin Ajdari;Behrang Montazerzohour;Kimia Abdi;Hossein Asadi","doi":"10.1109/LCA.2024.3418874","DOIUrl":"10.1109/LCA.2024.3418874","url":null,"abstract":"In this paper, we \u0000<italic>first</i>\u0000 analyze a real storage system consisting of 72 SSDs utilizing either \u0000<italic>Hardware RAID</i>\u0000 (HW-RAID) or \u0000<italic>Software RAID</i>\u0000 (SW-RAID), and show that SW-RAID is up to 7× faster. We then reveal that with an increasing number of SSDs, the limited I/O parallelism in SAS controllers and multi-enclosure handshaking overheads cause a significant performance drop, minimizing the total \u0000<italic>I/O Per Second</i>\u0000 (IOPS) of a 144-SSD system to less than a single SSD. \u0000<italic>Second</i>\u0000, we disclose the most important architectural parameters that affect a large-scale storage system. \u0000<italic>Third</i>\u0000, we propose a framework that models a large-scale storage system and estimates the system IOPS and system resource usage for various architectures. We verify our framework against a real system and show its high accuracy. \u0000<italic>Lastly</i>\u0000, we analyze a use case of a 240-SSD system and reveal how our framework guides architects in storage system scaling.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture 加速以当代 GPU 微体系结构为目标的可编程引导
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-06-24 DOI: 10.1109/LCA.2024.3418448
Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn
{"title":"Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture","authors":"Hyesung Ji;Sangpyo Kim;Jaewan Choi;Jung Ho Ahn","doi":"10.1109/LCA.2024.3418448","DOIUrl":"10.1109/LCA.2024.3418448","url":null,"abstract":"Fully homomorphic encryption (FHE) enables computation on encrypted data without privacy leakage, among which GSW-based schemes are notable for supporting the evaluation of arbitrary univariate functions using programmable bootstrapping (PBS). Despite their wide applicability, their computational complexity in a single PBS impedes widespread adoption. However, at the application level, there are enough number of independent PBSs to achieve high data-level parallelism, making them suitable for running on GPUs known for their high parallel computing capability. On contemporary GPUs, peak integer performance has steadily increased, and the sizes of L2 cache and shared memory have also grown rapidly since the Volta architecture. Prior attempts to accelerate PBS on GPUs have fallen short due to their outdated implementations that cannot leverage recent GPU advances. In this paper, we introduce a GPU implementation that supports the latest PBS algorithm and incorporates GPU-trend-aware optimizations. Our implementation achieves a 10.8× performance improvement over the state-of-the-art (SOTA) GPU implementations on RTX 4090 and even outperforms the SOTA ASIC implementation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10570278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信