{"title":"Quantum Assertion Scheme for Assuring Qudit Robustness","authors":"Navnil Choudhury;Chao Lu;Kanad Basu","doi":"10.1109/LCA.2024.3483840","DOIUrl":"https://doi.org/10.1109/LCA.2024.3483840","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) computers are impeded by constraints such as limited qubit count and susceptibility to noise, hindering the progression towards fault-tolerant quantum computing for intricate and practical applications. To augment the computational capabilities of quantum computers, research is gravitating towards qudits featuring more than two energy levels. This paper presents the inaugural examination of the repercussions of errors in qudit circuits. Subsequently, we introduce an innovative qudit-based assertion framework aimed at automatically detecting and reporting errors and warnings during the quantum circuit design and compilation process. Our proposed framework, when subjected to evaluation on existing quantum computing platforms, can detect both new and existing bugs with up to 100% coverage of the bugs mentioned in this paper.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"247-250"},"PeriodicalIF":1.4,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim
{"title":"ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator","authors":"Hyungkyu Ham;Wonhyuk Yang;Yunseon Shin;Okkyun Woo;Guseul Heo;Sangyeop Lee;Jongse Park;Gwangsun Kim","doi":"10.1109/LCA.2024.3484648","DOIUrl":"https://doi.org/10.1109/LCA.2024.3484648","url":null,"abstract":"As DNNs (Deep Neural Networks) demand increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) has become more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes \u0000<italic>ONNXim</i>\u0000, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. For ease of simulation, it takes DNN models in the ONNX graph format generated from various deep learning frameworks. In addition, based on the observation that typical NPU cores process tensor tiles from SRAM with \u0000<italic>deterministic</i>\u0000 compute latency, we model computation accurately with an event-driven approach, avoiding the overhead of modeling cycle-level activities. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 365× over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"219-222"},"PeriodicalIF":1.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems","authors":"Md Tareq Mahmud;Ke Wang","doi":"10.1109/LCA.2024.3477253","DOIUrl":"https://doi.org/10.1109/LCA.2024.3477253","url":null,"abstract":"Chiplet-based multi-die integration has prevailed in modern computing system designs as it provides an agile solution for improving processing power with reduced manufacturing costs. In chiplet-based implementations, complete electronic systems are created by integrating individual hardware components through interconnection networks that consist of intra-chiplet network-on-chips (NoCs) and an inter-chiplet silicon interposer. Unfortunately, current interconnection designs have become the limiting factor in further scaling performance and energy efficiency. Specifically, inter-chiplet communication through silicon interposers is expensive due to the limited throughput. The existing wired Network-on-Chip (NoC) design is not good for multicast and broadcast communication because of limited bandwidth, high hop count and limited hardware resources leading to high overhead, latency and power consumption. On the other hand, wireless components might be helpful for multicast/broadcast communications, but they require high setup latency which cannot be used for one-to-one communication. In this paper, we propose a hybrid interconnection design for high-performance and low-power communications in chiplet-based systems. The proposed design consists of both wired and wireless interconnects that can adapt to diverse communication patterns and requirements. A dynamic control policy is proposed to maximize the performance and minimize power consumption by allocating all traffic to wireless or wired hardware components based on the communication patterns. Evaluation results show that the proposed hybrid design achieves 8% to 46% lower average end-to-end delay and 0.93 to 2.7× energy saving over the existing designs with minimized overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"215-218"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim
{"title":"GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU Performance","authors":"Hanna Cha;Sungchul Lee;Yeonan Ha;Hanhwi Jang;Joonsung Kim;Youngsok Kim","doi":"10.1109/LCA.2024.3476909","DOIUrl":"https://doi.org/10.1109/LCA.2024.3476909","url":null,"abstract":"Cycles Per Instruction (CPI) stacks help computer architects gain insight into the performance of their target architectures and applications. To bring the benefits of CPI stacks to Graphics Processing Units (GPUs), prior studies have proposed GPU cycle accounting mechanisms that can identify the stall cycles and their stall events on GPU architectures. Unfortunately, the prior studies cannot provide accurate insight into the GPU performance due to their coarse-grained, priority-driven, and issue-centric cycle accounting mechanisms. In this letter, we present \u0000<italic>GCStack</i>\u0000, a fine-grained GPU cycle accounting mechanism that constructs accurate CPI stacks and accurately identifies primary GPU performance bottlenecks. GCStack first exposes all the stall events of the outstanding warps of a warp scheduler, most of which get hidden by the existing mechanisms. Then, GCStack defers the classification of structural stalls, which the existing mechanisms cannot correctly identify with their issue-stage-centric stall classification, to the later stages of the GPU pipeline. We implement GCStack on Accel-Sim and show that GCStack provides more accurate CPI stacks and GPU performance insight than GSI, the state-of-the-art GPU cycle accounting mechanism whose primary focus is on characterizing memory-related stalls.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"235-238"},"PeriodicalIF":1.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterization and Analysis of Text-to-Image Diffusion Models","authors":"Eunyeong Cho;Jehyeon Bang;Minsoo Rhu","doi":"10.1109/LCA.2024.3466118","DOIUrl":"https://doi.org/10.1109/LCA.2024.3466118","url":null,"abstract":"Diffusion models have rapidly emerged as a prominent AI model for image generation. Despite its importance, however, little have been understood within the computer architecture community regarding this emerging AI algorithm. We conduct a workload characterization on the inference process of diffusion models using Stable Diffusion. Our characterization uncovers several critical performance bottlenecks of diffusion models, the computational overhead of which gets aggravated as image size increases. We also discuss several performance optimization opportunities that leverage approximation and sparsity, which help alleviate diffusion model's computational complexity. These findings highlight the need for domain-specific hardware that reaps out the benefits of our proposal, paving the way for accelerated image generation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"227-230"},"PeriodicalIF":1.4,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Implementation of Knuth Yao Sampler on Reconfigurable Hardware","authors":"Paresh Baidya;Rourab Paul;Swagata Mandal;Sumit Kumar Debnath","doi":"10.1109/LCA.2024.3454490","DOIUrl":"10.1109/LCA.2024.3454490","url":null,"abstract":"Lattice-based cryptography offers a promising alternative to traditional cryptographic schemes due to its resistance against quantum attacks. Discrete Gaussian sampling plays a crucial role in lattice-based cryptographic algorithms such as Ring Learning with error (R-LWE) for generating the coefficient of the polynomials. The Knuth Yao Sampler is a widely used discrete Gaussian sampling technique in Lattice-based cryptography. On the other hand, Lattice based cryptography involves resource intensive complex computation. Due to the presence of inherent parallelism, on field programmability Field Programmable Gate Array (FPGA) based reconfigurable hardware can be a good platform for the implementation of Lattice-based cryptographic algorithms. In this work, an efficient implementation of Knuth Yao Sampler on reconfigurable hardware is proposed that not only reduces the resource utilization but also enhances the speed of the sampling operation. The proposed method reduces look up table (LUT) requirement by almost 29% and enhances the speed by almost 17 times compared to the method proposed by the authors in (Sinha Roy et al., 2014).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"195-198"},"PeriodicalIF":1.4,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang
{"title":"SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization","authors":"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2024.3452699","DOIUrl":"10.1109/LCA.2024.3452699","url":null,"abstract":"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"199-202"},"PeriodicalIF":1.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim
{"title":"Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training","authors":"Haeyoon Cho;Hyojun Son;Jungmin Choi;Byungil Koh;Minho Ha;John Kim","doi":"10.1109/LCA.2024.3445948","DOIUrl":"10.1109/LCA.2024.3445948","url":null,"abstract":"Deep learning recommendation model (DLRM) is an important class of deep learning networks that are commonly used in many applications. DRLM presents unique challenges, especially for scale-out training since it not only has compute and memory-intensive components but the communication between the multiple GPUs is also on the critical path. In this work, we propose how \u0000<italic>cold</i>\u0000 data in DLRM embedding tables can be exploited to propose proactive embedding. In particular, proactive embedding allows embedding table accesses to be done in advance to reduce the impact of the memory access latency by overlapping the embedding access with communication. Our analysis of proactive embedding demonstrates that it can improve overall training performance by 46%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"203-206"},"PeriodicalIF":1.4,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Octopus: A Cycle-Accurate Cache System Simulator","authors":"Mohamed Hossam;Salah Hessien;Mohamed Hassan","doi":"10.1109/LCA.2024.3441941","DOIUrl":"10.1109/LCA.2024.3441941","url":null,"abstract":"This paper introduces Octopus\u0000<sup>1</sup>\u0000, an open-source cycle-accurate cache system simulator with flexible interconnect models. Octopus meticulously simulates various cache system and interconnect components, including controllers, data arrays, coherence protocols, and arbiters. Being cycle-accurate enables Octopus to precisely model the behavior of target systems, while monitoring every memory request cycle by cycle. The design approach of Octopus distinguishes it from existing cache memory simulators, as it does not enforce a fixed memory system architecture but instead offers flexibility in configuring component connections and parameters, enabling simulation of diverse memory architectures. Moreover, the simulator provides two dual modes of operation, standalone and full-system simulation, which attains the best of both worlds benefits: fast simulations and high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"191-194"},"PeriodicalIF":1.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142183931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cycle-Oriented Dynamic Approximation: Architectural Framework to Meet Performance Requirements","authors":"Yuya Degawa;Shota Suzuki;Junichiro Kadomoto;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2024.3439318","DOIUrl":"10.1109/LCA.2024.3439318","url":null,"abstract":"Approximate computing achieves shorter execution times and reduced energy consumption in areas where precise computation written in a program is not essential to meet a goal. When applying the approximations, it is vital to satisfy the required quality-of-service (QoS) (execution time) and quality-of-results (QoR) (output accuracy). Existing methods have difficulty in maintaining a constant QoS or impose a burden on programmers. In this study, we propose the Cycle-oriented Dynamic Approximation (CODAX) algorithms and processor architecture that minimize the burden on the programmer and maintain the execution time close to the required QoS while providing the user with an option to satisfy their QoR requirement. CODAX operates based on a threshold that indicates the maximum number of cycles available for one loop iteration. The threshold automatically increases or decreases at runtime to bring the total number of elapsed cycles close to the required QoS. Furthermore, CODAX allows the user to change the threshold to indirectly guarantee the required QoR. Our simulation revealed that CODAX brought the actual number of executed cycles close to the expected number for four workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"211-214"},"PeriodicalIF":1.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}