IEEE Computer Architecture Letters最新文献_第6页

GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance GCStack：一个GPU周期核算机制，提供准确的GPU性能洞察

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-10-09 DOI: 10.1109/LCA.2024.3476909

Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim

{"title":"GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance","authors":"Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim","doi":"10.1109/LCA.2024.3476909","DOIUrl":"https://doi.org/10.1109/LCA.2024.3476909","url":null,"abstract":"Cycles Per Instruction (CPI) stacks help computer architects gain insight into the performance of their target architectures and applications. To bring the benefits of CPI stacks to Graphics Processing Units (GPUs), prior studies have proposed GPU cycle accounting mechanisms that can identify the stall cycles and their stall events on GPU architectures. Unfortunately, the prior studies cannot provide accurate insight into the GPU performance due to their coarse-grained, priority-driven, and issue-centric cycle accounting mechanisms. In this letter, we present \u0000<italic>GCStack</i>\u0000, a fine-grained GPU cycle accounting mechanism that constructs accurate CPI stacks and accurately identifies primary GPU performance bottlenecks. GCStack first exposes all the stall events of the outstanding warps of a warp scheduler, most of which get hidden by the existing mechanisms. Then, GCStack defers the classification of structural stalls, which the existing mechanisms cannot correctly identify with their issue-stage-centric stall classification, to the later stages of the GPU pipeline. We implement GCStack on Accel-Sim and show that GCStack provides more accurate CPI stacks and GPU performance insight than GSI, the state-of-the-art GPU cycle accounting mechanism whose primary focus is on characterizing memory-related stalls.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"235-238"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterization and Analysis of Text-to-Image Diffusion Models 文本到图像扩散模型的特征和分析

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-09-26 DOI: 10.1109/LCA.2024.3466118

Eunyeong Cho;Jehyeon Bang;Minsoo Rhu

引用次数: 0

Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware 在可重构硬件上高效实现 Knuth Yao 采样器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-09-03 DOI: 10.1109/LCA.2024.3454490

Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath

{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"195-198"},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization SmartQuant：基于 CXL 的人工智能模型存储，支持运行时可配置的权重量化

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-09-02 DOI: 10.1109/LCA.2024.3452699

Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang

引用次数: 0

Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training 在冷数据上主动嵌入，用于深度学习推荐模型训练

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-08-28 DOI: 10.1109/LCA.2024.3445948

Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim

引用次数: 0

Octopus: A Cycle-Accurate Cache System Simulator 章鱼：周期精确的高速缓存系统模拟器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-08-12 DOI: 10.1109/LCA.2024.3441941

Mohamed Hossam;Salah Hessien;Mohamed Hassan

引用次数: 0

Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements 面向周期的动态逼近：满足性能要求的架构框架

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-08-06 DOI: 10.1109/LCA.2024.3439318

Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai

{"title":"Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements","authors":"Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2024.3439318","DOIUrl":"10.1109/LCA.2024.3439318","url":null,"abstract":"Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"211-214"},"PeriodicalIF":1.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC LTE：用于后量子方案 HQC 的轻量级省时硬件编码器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-07-30 DOI: 10.1109/LCA.2024.3435495

Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie

{"title":"LTE: Lightweight and Time-Efficient Hardware Encoder for Post-Quantum Scheme HQC","authors":"Yazheng Tu;Pengzhou He;Chip-Hong Chang;Jiafeng Xie","doi":"10.1109/LCA.2024.3435495","DOIUrl":"10.1109/LCA.2024.3435495","url":null,"abstract":"Post-quantum cryptography (PQC) has gained increasing attention across the hardware research community, especially after the National Institute of Standards and Technology (NIST) started the PQC standardization process. There are, however, very few hardware implementations reported for the Hamming Quasi-Cyclic (HQC), which is one of the NIST fourth-round PQC candidates. As encoding is an important step in code-based public key encryption scheme, this paper presents a \u0000<bold>L</b>\u0000ightweight and \u0000<bold>T</b>\u0000ime-\u0000<bold>E</b>\u0000fficient (LTE) hardware encoder for HQC. Our proposed design features a streamlined data flow setup to manage the iterative computations between the Reed-Solomon encoder and the Reed-Muller encoder, and a detailed analysis to obtain an optimized Galois field multiplier. The proposed LTE encoder is also implemented on an FPGA platform to demonstrate its area-time efficiency. Our evaluation shows that the proposed hardware implementation of HQC encoder outperforms the most recently reported state-of-the-art hardware implementation with 34.5%, 26.7%, and 35.2% reduction in area-delay product (ADP) for hqc-128, hqc-192, and hqc-256, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"187-190"},"PeriodicalIF":1.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architecting Compatible PIM Protocol for CPU-PIM Collaboration 为 CPU-PIM 协作构建兼容的 PIM 协议

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-07-24 DOI: 10.1109/LCA.2024.3432936

Seunghyuk Yu;Hyeonu Kim;Kyoungho Jeun;Sunyoung Hwang;Eojin Lee

引用次数: 0

A Quantitative Analysis of State Space Model-Based Large Language Model: Study of Hungry Hungry Hippos 基于状态空间模型的大型语言模型定量分析：饥饿的河马》研究

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-07-03 DOI: 10.1109/LCA.2024.3422492

Dongho Yoon;Taehun Kim;Jae W. Lee;Minsoo Rhu

引用次数: 0