Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim
{"title":"NoHammer: Preventing Row Hammer With Last-Level Cache Management","authors":"Seunghak Lee;Ki-Dong Kang;Gyeongseo Park;Nam Sung Kim;Daehoon Kim","doi":"10.1109/LCA.2023.3320670","DOIUrl":"https://doi.org/10.1109/LCA.2023.3320670","url":null,"abstract":"Row Hammer (RH) is a circuit-level phenomenon where repetitive activation of a DRAM row causes bit-flips in adjacent rows. Prior studies that rely on extra refreshes to mitigate RH vulnerability demonstrate that bit-flips can be prevented effectively. However, its implementation is challenging due to the significant performance degradation and energy overhead caused by the additional extra refresh for the RH mitigation. To overcome challenges, some studies propose techniques to mitigate the RH attack without relying on extra refresh. These techniques include delaying the activation of an aggressor row for a certain amount of time or swapping an aggressor row with another row to isolate it from victim rows. Although such techniques do not require extra refreshes to mitigate RH, the activation delaying technique may result in high-performance degradation in false-positive cases, and the swapping technique requires high storage overheads to track swap information. We propose \u0000<monospace>NoHammer</monospace>\u0000, an efficient RH mitigation technique to prevent the bit-flips caused by the RH attack by utilizing Last-Level Cache (LLC) management. \u0000<monospace>NoHammer</monospace>\u0000 temporarily extends the associativity of the cache set that is being targeted by utilizing another cache set as the extended set and keeps the cache lines of aggressor rows on the extended set under the eviction-based RH attack. Along with the modification of the LLC replacement policy, \u0000<monospace>NoHammer</monospace>\u0000 ensures that the aggressor row's cache lines are not evicted from the LLC under the RH attack. In our evaluation, we demonstrate that \u0000<monospace>NoHammer</monospace>\u0000 gives 6% higher performance than a baseline without any RH mitigation technique by replacing excessive cache misses caused by the RH attack with LLC hits through sophisticated LLC management, while requiring 45% less storage than prior proposals.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"157-160"},"PeriodicalIF":2.3,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal
{"title":"Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures","authors":"Pau Escofet;Anabel Ovide;Carmen G. Almudever;Eduard Alarcón;Sergi Abadal","doi":"10.1109/LCA.2023.3318857","DOIUrl":"https://doi.org/10.1109/LCA.2023.3318857","url":null,"abstract":"Modular quantum computing architectures offer a promising alternative to monolithic designs for overcoming the scaling limitations of current quantum computers. To achieve scalability beyond small prototypes, quantum architectures are expected to adopt a modular approach, featuring clusters of tightly connected quantum bits with sparser connections between these clusters. Efficiently distributing qubits across multiple processing cores is critical for improving quantum computing systems’ performance and scalability. To address this challenge, we propose the Hungarian Qubit Assignment (HQA) algorithm, which leverages the Hungarian algorithm to improve qubit-to-core assignment. The HQA algorithm considers the interactions between qubits over the entire circuit, enabling fine-grained partitioning and enhanced qubit utilization. We compare the HQA algorithm with state-of-the-art alternatives through comprehensive experiments using both real-world quantum algorithms and random quantum circuits. The results demonstrate the superiority of our proposed approach, outperforming existing methods, with an average improvement of 1.28×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"161-164"},"PeriodicalIF":2.3,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim
{"title":"Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity","authors":"Yonghae Kim;Anurag Kar;Jaewon Lee;Jaekyu Lee;Hyesoon Kim","doi":"10.1109/LCA.2023.3306326","DOIUrl":"https://doi.org/10.1109/LCA.2023.3306326","url":null,"abstract":"Software attacks typically operate by overwriting control data, such as a return address and a function pointer, and hijacking the control flow of a program. To prevent such attacks, a number of control-flow integrity (CFI) solutions have been proposed. Nevertheless, most prior work finds difficulties in serving two ends: performance and security. In particular, protecting forward edges, i.e., indirect calls, remains challenging to solve without trading off one for another. In this work, we propose Code-Pointer Tagging (CPT), a novel dynamic CFI solution combined with cryptographic protection. Our key observation is that a pointer's message authentication code (MAC) can be associated with the pointer's CFI label used for CFI checks. We find that such an approach not only enables a space-efficient control-flow graph (CFG) storage but also achieves highly-efficient CFI checks performed along with implicit pointer authentication. To enable CPT, we implement lightweight compiler and hardware support. We prototype our design in an FPGA-accelerated RISC-V hardware simulation platform and conduct full-system-level evaluations. Our results show that CPT incurs a 1.2% average slowdown on the SPEC CPU C/C++ benchmarks while providing effective layered hardening on forward-edge CFI.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49988597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Performance Prediction for Efficient Distributed DNN Training","authors":"Yugyoung Yun;Eunhyeok Park","doi":"10.1109/LCA.2023.3316452","DOIUrl":"https://doi.org/10.1109/LCA.2023.3316452","url":null,"abstract":"Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this letter, we introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. Our experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs","authors":"Shiqing Zhang;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2023.3313203","DOIUrl":"https://doi.org/10.1109/LCA.2023.3313203","url":null,"abstract":"MCM-GPUs scale performance by integrating multiple chiplets within the same package. How to partition the aggregate compute resources across chiplets poses a fundamental trade-off in performance versus cost and sustainability. We propose the \u0000<italic>Performance Per Wafer (PPW)</i>\u0000 metric to explore this trade-off and we find that while performance is maximized with few large chiplets, and while cost and environmental footprint is minimized with many small chiplets, the optimum balance is achieved with a moderate number of medium-sized chiplets. The optimum number of chiplets depends on the workload and increases with increased inter-chiplet bandwidth.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"145-148"},"PeriodicalIF":2.3,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49962231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LV: Latency-Versatile Floating-Point Engine for High-Performance Deep Neural Networks","authors":"Yun-Chen Lo;Yu-Chih Tsai;Ren-Shuo Liu","doi":"10.1109/LCA.2023.3287096","DOIUrl":"10.1109/LCA.2023.3287096","url":null,"abstract":"Computing latency is an important system metric for Deep Neural Networks (DNNs) accelerators. To reduce latency, this work proposes \u0000<bold>LV</b>\u0000, a latency-versatile floating-point engine (FP-PE), which contains the following key contributions: 1) an approximate bit-versatile multiplier-and-accumulate (BV-MAC) unit with early shifter and 2) an on-demand fixed-point-to-floating-point conversion (FXP2FP) unit. The extensive experimental results show that LV outperforms baseline FP-PE and redundancy-aware FP-PE by up to 2.12× and 1.3× speedup using TSMC 40-nm technology, achieving comparable accuracy on the ImageNet classification tasks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"125-128"},"PeriodicalIF":2.3,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44362022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System","authors":"Lingfei Lu;Yudi Qiu;Shiyan Yi;Yibo Fan","doi":"10.1109/LCA.2023.3305668","DOIUrl":"10.1109/LCA.2023.3305668","url":null,"abstract":"Personalized recommendation system (RS) is widely used in the industrial community and occupies much time in AI computing centers. A critical component of RS is the embedding layer, which consists of sparse embedding lookups and is memory-bounded. Recent works have proposed near-memory processing (NMP) architectures to utilize high inner-memory bandwidth to speed up embedding lookups. These NMP works divide embedding vectors either horizontally or vertically. However, the effectiveness of horizontal or vertical partitioning is hard to guarantee under different memory configurations or embedding vector sizes. To improve this issue, we propose FeaNMP, a \u0000<underline>f</u>\u0000lexible \u0000<underline>e</u>\u0000mbedding-\u0000<underline>a</u>\u0000ware \u0000<underline>NMP</u>\u0000 architecture that accelerates the inference phase of RS. We explore different partitioning strategies in detail and design a flexible way to select optimal ones depending on different embedding dimensions and DDR configurations. As a result, compared to the state-of-the-art rank-level NMP work RecNMP, our work achieves up to 11.1× speedup for embedding layers under mix-dimensioned workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"165-168"},"PeriodicalIF":2.3,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136139267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn
{"title":"Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models","authors":"Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3305386","DOIUrl":"10.1109/LCA.2023.3305386","url":null,"abstract":"Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"113-116"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10208/10189818/10218731.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49570973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan
{"title":"Characterizing and Understanding Defense Methods for GNNs on GPUs","authors":"Meng Wu;Mingyu Yan;Xiaocheng Yang;Wenming Li;Zhimin Zhang;Xiaochun Ye;Dongrui Fan","doi":"10.1109/LCA.2023.3304638","DOIUrl":"10.1109/LCA.2023.3304638","url":null,"abstract":"Graph neural networks (GNNs) are widely deployed in many vital fields, but suffer from adversarial attacks, which seriously compromise the security in these fields. Plenty of defense methods have been proposed to mitigate the impact of these attacks, however, they have introduced extra time-consuming stages into the execution of GNNs. These extra stages need to be accelerated because the end-to-end acceleration is essential for GNNs to achieve fast development and deployment. To disclose the performance bottlenecks, execution patterns, execution semantics, and overheads of the defense methods for GNNs, we characterize and explore these extra stages on GPUs. Given the characterization and exploration, we provide several useful guidelines for both software and hardware optimizations to accelerate the defense methods for GNNs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"137-140"},"PeriodicalIF":2.3,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44243157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang
{"title":"By-Software Branch Prediction in Loops","authors":"Maziar Goudarzi;Reza Azimi;Julian Humecki;Faizaan Rehman;Richard Zhang;Chirag Sethi;Tanishq Bomman;Yuqi Yang","doi":"10.1109/LCA.2023.3304613","DOIUrl":"https://doi.org/10.1109/LCA.2023.3304613","url":null,"abstract":"Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch \u0000<italic>backslice</i>\u0000) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and 23% IPC gain on average, compared to a core with TAGE-SC-L-64KB branch predictor.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49993042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}