{"title":"Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture","authors":"Samer Kurzum;Gil Shomron;Freddy Gabbay;Uri Weiser","doi":"10.1109/LCA.2023.3275909","DOIUrl":"10.1109/LCA.2023.3275909","url":null,"abstract":"Deep neural networks (DNNs) require abundant multiply-and-accumulate (MAC) operations. Thanks to DNNs’ ability to accommodate noise, some of the computational burden is commonly mitigated by quantization–that is, by using lower precision floating-point operations. Layer granularity is the preferred method, as it is easily mapped to commodity hardware. In this paper, we propose Dynamic Asymmetric Architecture (DAA), in which the micro-architecture decides what the precision of each MAC operation should be during runtime. We demonstrate a DAA with two data streams and a value-based controller that decides which data stream deserves the higher precision resource. We evaluate this mechanism in terms of accuracy on a number of convolutional neural networks (CNNs) and demonstrate its feasibility on top of a systolic array. Our experimental analysis shows that DAA potentially achieves 2x throughput improvement for ResNet-18 while saving 35% of the energy with less than 0.5% degradation in accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45178350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-Implemented Lightweight Accelerator for Large Integer Polynomial Multiplication","authors":"Pengzhou He;Yazheng Tu;Çetin Kaya Koç;Jiafeng Xie","doi":"10.1109/LCA.2023.3274931","DOIUrl":"10.1109/LCA.2023.3274931","url":null,"abstract":"Large integer polynomial multiplication is frequently used as a key component in post-quantum cryptography (PQC) algorithms. Following the trend that efficient hardware implementation for PQC is emphasized, in this letter, we propose a new hardware-implemented lightweight accelerator for the large integer polynomial multiplication of Saber (one of the National Institute of Standards and Technology third-round finalists). First, we provided a derivation process to obtain the algorithm for the targeted polynomial multiplication. Then, the proposed algorithm is mapped into an optimized hardware accelerator. Finally, we demonstrated the efficiency of the proposed design, e.g., this accelerator with \u0000<inline-formula><tex-math>$v=32$</tex-math></inline-formula>\u0000 has at least 48.37% less area-delay product (ADP) than the existing designs. The outcome of this work is expected to provide useful references for efficient implementation of other PQC.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47266691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Andrew Roberts;Haojie Ye;Tony Brewer;Sean Eilert
{"title":"In-Memory Versioning (IMV)","authors":"David Andrew Roberts;Haojie Ye;Tony Brewer;Sean Eilert","doi":"10.1109/LCA.2023.3273124","DOIUrl":"10.1109/LCA.2023.3273124","url":null,"abstract":"In this letter, we propose and evaluate designs for a novel hardware-assisted data versioning system (in-memory versioning or IMV) in the context of high-performance computing. Our main novelty and advantage over recent published work is that it does not require any changes to host processor logic, instead augmenting a memory controller within memory modules. It is faster and more efficient than existing high-performance computing (HPC) checkpointing schemes and works from hours to sub-second checkpoint intervals. The main premise is to perform most operations in hardware at cache-line granularity, avoiding operating system (OS) latency and page copying bandwidth overhead. Energy is saved by keeping data movement in the memory module, compared with page granularity cross channel or cross-network copying that is currently used. For a 1-second checkpoint commit interval, we demonstrate up to 20x checkpoint performance and 70x energy savings using IMV versus page copy-on-write (COW).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"65-68"},"PeriodicalIF":2.3,"publicationDate":"2023-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49519371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient Bayesian Inference Using Bitstream Computing","authors":"Soroosh Khoram;Kyle Daruwalla;Mikko Lipasti","doi":"10.1109/LCA.2023.3238584","DOIUrl":"10.1109/LCA.2023.3238584","url":null,"abstract":"Uncertainty quantification is critical to many machine learning applications especially in mobile and edge computing tasks like self-driving cars, robots, and mobile devices. Bayesian Neural Networks can be used to provide these uncertainty quantifications but they come at extra computation costs. However, power and energy can be limited at the edge. In this work, we propose using stochastic bitstream computing substrates for deploying BNNs which can significantly reduce power and costs. We design our Bayesian Bitstream Processor hardware for an audio classification task as a test case and show that it can outperform a micro-controller baseline in energy by two orders of magnitude and delay by an order of magnitude, at lower power.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47107644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intelligent SSD Firmware for Zero-Overhead Journaling","authors":"Hanyeoreum Bae;Donghyun Gouk;Seungjun Lee;Jiseon Kim;Sungjoon Koh;Jie Zhang;Myoungsoo Jung","doi":"10.1109/LCA.2023.3243695","DOIUrl":"10.1109/LCA.2023.3243695","url":null,"abstract":"We propose Check0-SSD, an intelligent SSD firmware to offer the best system-level fault-tolerance without performance degradation and lifetime shortening. Specifically, the SSD firmware autonomously removes transaction checkpointing, which eliminates redundant writes to the flash backend. To this end, Check0-SSD dynamically classifies journal descriptor/commit requests at runtime and switches the address spaces between journal and data regions by examining the host's filesystem layout and journal region information in a self-governing manner. Our evaluations demonstrate that Check0-SSD can protect both data and metadata with 89% enhanced storage lifetime while exhibiting similar or even better performance compared to the norecovery SSD.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"25-28"},"PeriodicalIF":2.3,"publicationDate":"2023-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49254659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Last-Level Cache Insertion and Promotion Policy in the Presence of Aggressive Prefetching","authors":"Daniel A. Jiménez;Elvira Teran;Paul V. Gratz","doi":"10.1109/LCA.2023.3242178","DOIUrl":"10.1109/LCA.2023.3242178","url":null,"abstract":"The last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of going to main memory. LLC management has been the topic of intense research focusing on two main techniques: replacement and prefetching. However, these two ideas are often evaluated separately, with one being studied outside the context of the state-of-the-art in the other. We find that high-performance replacement and highly accurate pattern-based prefetching do not result in synergistic improvements in performance. The overhead of complex replacement policies is wasted in the presence of aggressive prefetchers. We find that a simple replacement policy with minimal overhead provides at least the same benefit as a state-of-the-art replacement policy in the presence of aggressive pattern-based prefetching. Our proposal is based on the idea of using a genetic algorithm to search the space of insertion and promotion policies that generalize transitions in the recency stack for the least-recently-used policy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"17-20"},"PeriodicalIF":2.3,"publicationDate":"2023-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45606044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaebin Moon;Wanju Doh;Kwanhee Kyung;Eojin Lee;Jung Ho Ahn
{"title":"ADT: Aggressive Demotion and Promotion for Tiered Memory","authors":"Yaebin Moon;Wanju Doh;Kwanhee Kyung;Eojin Lee;Jung Ho Ahn","doi":"10.1109/LCA.2023.3236685","DOIUrl":"10.1109/LCA.2023.3236685","url":null,"abstract":"Tiered memory using DRAM as upper-tier (fast memory) and emerging slower-but-larger byte-addressable memory as lower-tier (slow memory) is a promising approach to expanding main-memory capacity. Based on the observation that there are many cold pages in data-center applications, \u0000<italic>proactive demotion</i>\u0000 schemes demote cold pages to slow memory even when free space in fast memory is not deficient. Prior works on proactive demotion lower the requirement of expensive fast-memory capacity by reducing applications’ resident set size in fast memory. Also, some of the prior works mitigate the massive performance drop due to insufficient fast-memory capacity when there is a spike in demand for hot data. However, there is room for further improvement to save a larger fast-memory capacity with further aggressive demotion, which can fully reap the aforementioned advantages of proactive demotion. In this paper, we propose a new proactive demotion scheme, ADT, which performs \u0000<bold>a</b>\u0000ggressive \u0000<bold>d</b>\u0000emotion and promotion for \u0000<bold>t</b>\u0000iered memory. Using the memory access locality within the unit in which applications and memory allocators allocate memory, ADT extends the unit of demotion/promotion from the page adopted by prior works to make its demotion more aggressive. By performing demotion and promotion by the extended unit, ADT reduces 29% of fast-memory usage with only a 2.3% performance drop. Also, it achieves 2.28× speedup compared to the default Linux kernel when the system's memory usage is larger than fast-memory capacity, which outperforms state-of-the-art schemes for tiered memory management.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"21-24"},"PeriodicalIF":2.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43125075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HAMMER: Hardware-Friendly Approximate Computing for Self-Attention With Mean-Redistribution And Linearization","authors":"Seonho Lee;Ranggi Hwang;Jongse Park;Minsoo Rhu","doi":"10.1109/LCA.2022.3233832","DOIUrl":"10.1109/LCA.2022.3233832","url":null,"abstract":"The recent advancement of the natural language processing (NLP) models is the result of the ever-increasing model size and datasets. Most of these modern NLP models adopt the Transformer based model architecture, whose main bottleneck is exhibited in the self-attention mechanism. As the computation required for self-attention increases rapidly as the model size gets larger, self-attentions have been the main challenge for deploying NLP models. Consequently, there are several prior works which sought to address this bottleneck, but most of them suffer from significant design overheads and additional training requirements. In this work, we propose HAMMER, hardware-friendly approximate computing solution for self-attentions employing mean-redistribution and linearization, which effectively increases the performance of self-attention mechanism with low overheads. Compared to previous state-of-the-art self-attention accelerators, HAMMER improves performance by \u0000<inline-formula><tex-math>$1.2-1.6times$</tex-math></inline-formula>\u0000 and energy efficiency by \u0000<inline-formula><tex-math>$1.2-1.5times$</tex-math></inline-formula>\u0000.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"13-16"},"PeriodicalIF":2.3,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47478009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancing Compilation of DNNs for FPGAs Using Operation Set Architectures","authors":"Burkhard Ringlein;Francois Abel;Dionysios Diamantopoulos;Beat Weiss;Christoph Hagleitner;Dietmar Fey","doi":"10.1109/LCA.2022.3227643","DOIUrl":"10.1109/LCA.2022.3227643","url":null,"abstract":"The slow-down of technology scaling combined with the exponential growth of modern machine learning and artificial intelligence models has created a demand for specialized accelerators, such as GPUs, ASICs, and field-programmable gate arrays (FPGAs). FPGAs can be reconfigured and have the potential to outperform other accelerators, while also being more energy-efficient, but are cumbersome to use with today's fractured landscape of tool flows. We propose the concept of an operation set architecture to overcome the current incompatibilities and hurdles in using DNN-to-FPGA compilers by combining existing specialized frameworks into one organic compiler that also allows the efficient and automatic re-use of existing community tools. Furthermore, we demonstrate that mixing different existing frameworks can increase the efficiency by more than an order of magnitude.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"9-12"},"PeriodicalIF":2.3,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41745751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CoreNap: Energy Efficient Core Allocation for Latency-Critical Workloads","authors":"Gyeongseo Park;Ki-Dong Kang;Minho Kim;Daehoon Kim","doi":"10.1109/LCA.2022.3227629","DOIUrl":"10.1109/LCA.2022.3227629","url":null,"abstract":"In data-center servers, the dynamic core allocation for Latency-Critical (LC) applications can play a crucial role in improving energy efficiency under Service Level Objective (SLO) constraints, allowing cores to enter idle states (i.e., C-states) that consume less power by turning off a part of hardware components of a processor. However, prior studies focus on the core allocation for application threads while not considering cores involved in network packet processing, even though packet processing affects not only response latency but also energy consumption considerably. In this paper, we first investigate the impacts of the explicit core allocation for network packet processing on the tail response latency and energy consumption while running LC applications. We observe that co-adjusting the number of cores for network packet processing along with the number of cores for LC application threads can improve energy efficiency substantially, compared with adjusting the number of cores only for application threads, as prior studies do. In addition, we propose a dynamic core allocation, called \u0000<monospace>CoreNap</monospace>\u0000, which allocates/de-allocates cores for both LC application threads and packet processing. \u0000<monospace>CoreNap</monospace>\u0000 measures the CPU-utilization by application threads and packet processing individually, and predicts response latency and power consumption when the combination of core allocation is enforced via a lightweight prediction model. Based on the prediction, \u0000<monospace>CoreNap</monospace>\u0000 chooses/enforces the energy-efficient combination of core allocation. Our experimental results show that \u0000<monospace>CoreNap</monospace>\u0000 reduces energy consumption by up to 18.6% compared with state-of-the-art study that adjusts cores only for LC application in parallel packet processing environments.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 1","pages":"1-4"},"PeriodicalIF":2.3,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42616181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}