IEEE Computer Architecture Letters最新文献_第8页

SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication 缩放：可伸缩和面积有效的收缩加速器为三元多项式乘法

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-11-25 DOI: 10.1109/LCA.2024.3505872

Samuel Coulon;Tianyou Bao;Jiafeng Xie

{"title":"SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication","authors":"Samuel Coulon;Tianyou Bao;Jiafeng Xie","doi":"10.1109/LCA.2024.3505872","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505872","url":null,"abstract":"Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring \u0000<inline-formula><tex-math>$mathbb {Z}_{q}/(x^{n}+1)$</tex-math></inline-formula>\u0000 where one input polynomial has ternary coefficients {−1,0,1} and the other has large integer coefficients {0, \u0000<inline-formula><tex-math>$q-1$</tex-math></inline-formula>\u0000}, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel \u0000SCAL\u0000able and area-\u0000E\u0000fficient \u0000S\u0000ystolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"243-246"},"PeriodicalIF":1.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Case for Hardware Memoization in Server CPUs 服务器cpu硬件记忆的案例

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-11-22 DOI: 10.1109/LCA.2024.3505075

Farid Samandi;Natheesan Ratnasegar;Michael Ferdman

{"title":"A Case for Hardware Memoization in Server CPUs","authors":"Farid Samandi;Natheesan Ratnasegar;Michael Ferdman","doi":"10.1109/LCA.2024.3505075","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505075","url":null,"abstract":"Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps. We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"231-234"},"PeriodicalIF":1.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline 三维高斯飞溅渲染管道的表征与分析

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-11-21 DOI: 10.1109/LCA.2024.3504579

Jiwon Lee;Yunjae Lee;Youngeun Kwon;Minsoo Rhu

引用次数: 0

SPGPU: Spatially Programmed GPU SPGPU：空间编程 GPU

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-11-14 DOI: 10.1109/LCA.2024.3499339

Shizhuo Zhu;Illia Shkirko;Jacob Levinson;Zhengrong Wang;Tony Nowatzki

引用次数: 0

Quantum Assertion Scheme for Assuring Qudit Robustness 确保 Qudit 稳健性的量子断言方案

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-11-04 DOI: 10.1109/LCA.2024.3483840

Navnil Choudhury;Chao Lu;Kanad Basu

引用次数: 0

ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator ONNXim：快速、周期级多核 NPU 仿真器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-10-22 DOI: 10.1109/LCA.2024.3484648

Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim

{"title":"ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator","authors":"Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim","doi":"10.1109/LCA.2024.3484648","DOIUrl":"https://doi.org/10.1109/LCA.2024.3484648","url":null,"abstract":"As DNNs (Deep Neural Networks) demand increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) has become more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes \u0000<italic>ONNXim\u0000, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. For ease of simulation, it takes DNN models in the ONNX graph format generated from various deep learning frameworks. In addition, based on the observation that typical NPU cores process tensor tiles from SRAM with \u0000<italic>deterministic\u0000 compute latency, we model computation accurately with an event-driven approach, avoiding the overhead of modeling cycle-level activities. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 365× over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"219-222"},"PeriodicalIF":1.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems 基于高性能和高能效芯片系统的灵活混合互连设计

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-10-09 DOI: 10.1109/LCA.2024.3477253

Md Tareq Mahmud;Ke Wang

{"title":"A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems","authors":"Md Tareq Mahmud;Ke Wang","doi":"10.1109/LCA.2024.3477253","DOIUrl":"https://doi.org/10.1109/LCA.2024.3477253","url":null,"abstract":"Chiplet-based multi-die integration has prevailed in modern computing system designs as it provides an agile solution for improving processing power with reduced manufacturing costs. In chiplet-based implementations, complete electronic systems are created by integrating individual hardware components through interconnection networks that consist of intra-chiplet network-on-chips (NoCs) and an inter-chiplet silicon interposer. Unfortunately, current interconnection designs have become the limiting factor in further scaling performance and energy efficiency. Specifically, inter-chiplet communication through silicon interposers is expensive due to the limited throughput. The existing wired Network-on-Chip (NoC) design is not good for multicast and broadcast communication because of limited bandwidth, high hop count and limited hardware resources leading to high overhead, latency and power consumption. On the other hand, wireless components might be helpful for multicast/broadcast communications, but they require high setup latency which cannot be used for one-to-one communication. In this paper, we propose a hybrid interconnection design for high-performance and low-power communications in chiplet-based systems. The proposed design consists of both wired and wireless interconnects that can adapt to diverse communication patterns and requirements. A dynamic control policy is proposed to maximize the performance and minimize power consumption by allocating all traffic to wireless or wired hardware components based on the communication patterns. Evaluation results show that the proposed hybrid design achieves 8% to 46% lower average end-to-end delay and 0.93 to 2.7× energy saving over the existing designs with minimized overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"215-218"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance GCStack：一个GPU周期核算机制，提供准确的GPU性能洞察

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-10-09 DOI: 10.1109/LCA.2024.3476909

Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim

{"title":"GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance","authors":"Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim","doi":"10.1109/LCA.2024.3476909","DOIUrl":"https://doi.org/10.1109/LCA.2024.3476909","url":null,"abstract":"Cycles Per Instruction (CPI) stacks help computer architects gain insight into the performance of their target architectures and applications. To bring the benefits of CPI stacks to Graphics Processing Units (GPUs), prior studies have proposed GPU cycle accounting mechanisms that can identify the stall cycles and their stall events on GPU architectures. Unfortunately, the prior studies cannot provide accurate insight into the GPU performance due to their coarse-grained, priority-driven, and issue-centric cycle accounting mechanisms. In this letter, we present \u0000<italic>GCStack\u0000, a fine-grained GPU cycle accounting mechanism that constructs accurate CPI stacks and accurately identifies primary GPU performance bottlenecks. GCStack first exposes all the stall events of the outstanding warps of a warp scheduler, most of which get hidden by the existing mechanisms. Then, GCStack defers the classification of structural stalls, which the existing mechanisms cannot correctly identify with their issue-stage-centric stall classification, to the later stages of the GPU pipeline. We implement GCStack on Accel-Sim and show that GCStack provides more accurate CPI stacks and GPU performance insight than GSI, the state-of-the-art GPU cycle accounting mechanism whose primary focus is on characterizing memory-related stalls.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"235-238"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterization and Analysis of Text-to-Image Diffusion Models 文本到图像扩散模型的特征和分析

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-09-26 DOI: 10.1109/LCA.2024.3466118

Eunyeong Cho;Jehyeon Bang;Minsoo Rhu

引用次数: 0

Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware 在可重构硬件上高效实现 Knuth Yao 采样器

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2024-09-03 DOI: 10.1109/LCA.2024.3454490

Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath

{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"195-198"},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0