{"title":"BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs","authors":"Gurunath Kadam, Danfeng Zhang, Adwait Jog","doi":"10.1109/HPCA47549.2020.00053","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00053","url":null,"abstract":"Graphics Processing Units (GPUs) are becoming a de facto choice for accelerating applications from a wide range of domains ranging from graphics to high-performance computing. As a result, it is getting increasingly desirable to improve the cooperation between traditional CPUs and accelerators such as GPUs. However, given the growing security concerns in the CPU space, closer integration of GPUs has further expanded the attack surface. For example, several side-channel attacks have shown that sensitive information can be leaked from the CPU end. In the same vein, several side-channel attacks are also now being developed in the GPU world. Overall, it is challenging to keep emerging CPU-GPU heterogeneous systems secure while maintaining their performance and energy efficiency. In this paper, we focus on developing an efficient defense mechanism for a type of correlation timing attack on GPUs. Such an attack has been shown to recover AES private keys by exploiting the relationship between the number of coalesced memory accesses and total execution time. Prior state-of-the-art defense mechanisms use inefficient randomized coalescing techniques to defend against such GPU attacks and require turning-off bandwidth conserving techniques such as caches and miss-status holding registers (MSHRs) to ensure security. To address these limitations, we propose BCoal – a new bucketing-based coalescing mechanism. BCoal significantly reduces the information leakage by always issuing pre-determined numbers of coalesced accesses (called buckets). With the help of a detailed application-level analysis, BCoal determines the bucket sizes and pads, if necessary, the number of real accesses with additional (padded) accesses to meet the bucket sizes ensuring the security against the correlation timing attack. Furthermore, BCoal generates the padded accesses such that the security is ensured even in the presence of MSHRs and caches. In effect, BCoal significantly improves GPU security at a modest performance loss.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131220564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations","authors":"Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, D. Albonesi, Zhiru Zhang","doi":"10.1109/HPCA47549.2020.00062","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00062","url":null,"abstract":"Tensor factorizations are powerful tools in many machine learning and data analytics applications. Tensors are often sparse, which makes sparse tensor factorizations memory bound. In this work, we propose a hardware accelerator that can accelerate both dense and sparse tensor factorizations. We co-design the hardware and a sparse storage format, which allows accessing the sparse data in vectorized and streaming fashion and maximizes the utilization of the memory bandwidth. We extract a common computation pattern that is found in numerous matrix and tensor operations and implement it in the hardware. By designing the hardware based on this common compute pattern, we can not only accelerate tensor factorizations but also mixed sparse-dense matrix operations. We show significant speedup and energy benefit over the state-of-the-art CPU and GPU implementations of tensor factorizations and over CPU, GPU and accelerators for matrix operations.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125584760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elaheh Sadredini, Reza Rahimi, Marzieh Lenjani, M. Stan, K. Skadron
{"title":"Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching","authors":"Elaheh Sadredini, Reza Rahimi, Marzieh Lenjani, M. Stan, K. Skadron","doi":"10.1109/HPCA47549.2020.00017","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00017","url":null,"abstract":"High-throughput and concurrent processing of thousands of patterns on each byte of an input stream is critical for many applications with real-time processing needs, such as network intrusion detection, spam filters, virus scanners, and many more. The demand for accelerated pattern matching has motivated several recent in-memory accelerator architectures for automata processing, which is an efficient computation model for pattern matching. Our key observations are: (1) all these architectures are based on 8-bit symbol processing (derived from ASCII), and our analysis on a large set of real-world automata benchmarks reveals that the 8-bit processing dramatically underutilizes hardware resources, and (2) multi-stride symbol processing, a major source of throughput growth, is not explored in the existing in-memory solutions. This paper presents Impala, a multi-stride in-memory automata processing architecture by leveraging our observations. The key insight of our work is that transforming 8-bit processing to 4-bit processing exponentially reduces hardware resources for state-matching and improves resource utilization. This, in turn, brings the opportunity to have a denser design, and be able to utilize more memory columns to process multiple symbols per cycle with a linear increase in state-matching resources. Impala thus introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time. Our empirical evaluations on a wide range of automata benchmarks reveal that Impala has on average 2.7X (up to 3.7X) higher throughput per unit area and 1.22X lower power consumption than Cache Automaton, which is the best performing prior work.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajiv Nishtala, V. Petrucci, P. Carpenter, Magnus Själander
{"title":"Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services","authors":"Rajiv Nishtala, V. Petrucci, P. Carpenter, Magnus Själander","doi":"10.1109/HPCA47549.2020.00023","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00023","url":null,"abstract":"Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124745860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seikwon Kim, Wonsang Kwak, Changdae Kim, Daehyeon Baek, Jaehyuk Huh
{"title":"Charge-Aware DRAM Refresh Reduction with Value Transformation","authors":"Seikwon Kim, Wonsang Kwak, Changdae Kim, Daehyeon Baek, Jaehyuk Huh","doi":"10.1109/HPCA47549.2020.00060","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00060","url":null,"abstract":"As the memory capacity in a system has been growing, refresh operations consume increasing ratios of the total DRAM power. To reduce the power consumption of such refresh operations, this paper proposes a novel value-aware refresh reduction technique called ZERO - REFRESH which exploits zero values in memory contents. A DRAM cell can retain the discharged state without refresh operations, and ZERO - REFRESH skips refresh operations on rows with all discharged cells. For abundant unallocated memory pages in typical systems, the operating system fills them with zeros to clean the contents. For those idle pages, ZERO - REFRESH can eliminate refresh operations in an OS-transparent way without any new interface to DRAM. However, for allocated memory pages, memory contents may not have many consecutive zero values to match the refresh granularity of DRAM. To increase the frequency of zero values and to arrange them to match the refresh granularity, ZERO - REFRESH transforms the value of memory blocks to the base and delta values, inspired by the prior BDI (Base-Delta-Immediate) compression technique. Once values are converted, bits are transposed to be stored as consecutive discharged bits at the refresh granularity. Such value transformation and rearrangement can make the memory contents friendly to refresh reduction based on discharged cells. The experimental results based on simulation show that the DRAM refresh operations are reduced by 37% on average for a set of benchmark applications, if the entire memory is allocated for the applications. If the memory usage statistics collected from three data center traces are applied, the DRAM refresh operations can be reduced by 46%, 57%, and 83% respectively for the three scenarios.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121405093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Imani, Mohammad Samragh Razlighi, Yeseong Kim, Saransh Gupta, F. Koushanfar, T. Simunic
{"title":"Deep Learning Acceleration with Neuron-to-Memory Transformation","authors":"M. Imani, Mohammad Samragh Razlighi, Yeseong Kim, Saransh Gupta, F. Koushanfar, T. Simunic","doi":"10.1109/HPCA47549.2020.00011","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00011","url":null,"abstract":"Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-theart DNNs on current systems mostly relies on either generalpurpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on-chip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which performs neuron-to-memory transformation in order to accelerate DNNs in a highly parallel architecture. RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, i.e., multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, e.g., weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their pre-computed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to improve computation efficiency further. Our evaluation shows that RAPIDNN achieves 68.4×, 49.5× energy efficiency improvement and 48.1×, 10.9× speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.5% quality loss.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115790210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EquiNox: Equivalent NoC Injection Routers for Silicon Interposer-Based Throughput Processors","authors":"Yunfan Li, Lizhong Chen","doi":"10.1109/HPCA47549.2020.00043","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00043","url":null,"abstract":"Throughput-oriented many-core processors demand highly efficient network-on-chip (NoC) architecture for data transferring. Recent advent of silicon interposer, stacked memory and 2.5D integration have further increased data transfer rate. This greatly intensifies traffic bottleneck in the NoC but, at the same time, also brings a significant new opportunity in utilizing wiring resources in the interposer. In this paper, we propose a novel concept called Equivalent Injection Routers (EIRs) which, together with interposer links, transform the few-to-many traffic pattern to many-to-many pattern, thus fundamentally solving the bottleneck problem. We have developed EquiNox as a design example. We utilize N-Queen and Monte Carlo Tree Search (MCTS) methods to help select EIRs by considering comprehensively from topological, architectural and physical aspects. Evaluation results show that, compared with prior work, the proposed EquiNox is able to reduce execution time by 23.5%, energy consumption by 18.9%, and EDP by 32.8%, under similar hardware cost.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133386842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Ren, Daniel Lustig, Evgeny Bolotin, A. Jaleel, Oreste Villa, D. Nellans
{"title":"HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems","authors":"X. Ren, Daniel Lustig, Evgeny Bolotin, A. Jaleel, Oreste Villa, D. Nellans","doi":"10.1109/HPCA47549.2020.00054","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00054","url":null,"abstract":"Prior work on GPU cache coherence has shown that simple hardware-or software-based protocols can be more than sufficient. However, in recent years, features such as multi-chip modules have added deeper hierarchy and non-uniformity into GPU memory systems. GPU programming models have chosen to expose this non-uniformity directly to the end user through scoped memory consistency models. As a result, there is room to improve upon earlier coherence protocols that were designed only for flat single-GPU hierarchies and/or simpler memory consistency models. In this paper, we propose HMG, a cache coherence protocol designed for forward-looking multi-GPU systems. HMG strikes a balance between simplicity and performance: it uses a readily-implementable VI-like protocol to track coherence states, but it tracks sharers using a hierarchical scheme optimized for mitigating the bandwidth limitations of inter-GPU links. HMG leverages the novel scoped, non-multi-copy-atomic properties of modern GPU memory models, and it avoids the overheads of invalidation acknowledgments and transient states that were needed to support prior GPU memory models. On a 4-GPU system, HMG improves performance over a software-controlled, bulk invalidation-based coherence mechanism by 26% and over a non-hierarchical hardware cache coherence protocol by 18%, thereby achieving 97% of the performance of an idealized caching system.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123256040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACR: Amnesic Checkpointing and Recovery","authors":"Ismail Akturk, Ulya R. Karpuzcu","doi":"10.1109/HPCA47549.2020.00013","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00013","url":null,"abstract":"Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Considering the growth of expected error rates, amortizing this overhead becomes especially challenging, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby, the checkpointing overhead. Even in a relatively small scale system, recomputation-based checkpointing can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126354202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mayank Parasar, Hossein Farrokhbakht, Natalie D. Enright Jerger, Paul V. Gratz, T. Krishna, Joshua San Miguel
{"title":"DRAIN: Deadlock Removal for Arbitrary Irregular Networks","authors":"Mayank Parasar, Hossein Farrokhbakht, Natalie D. Enright Jerger, Paul V. Gratz, T. Krishna, Joshua San Miguel","doi":"10.1109/HPCA47549.2020.00044","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00044","url":null,"abstract":"Correctness is a first-order concern in the design of computer systems. For multiprocessors, a primary correctness concern is the deadlock-free operation of the network and its coherence protocol; furthermore, we must guarantee the continued correctness of the network in the face of increasing faults. Designing for deadlock freedom is expensive. Prior solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to reactively resolve deadlocks as they occur. However, the precise confluence of events that lead to deadlocks is so rare that minimal resources and time should be spent to ensure deadlock freedom. To that end, we propose DRAIN, a subactive approach to remove potential deadlocks without needing to explicitly detect or avoid them. We simply let deadlocks happen and periodically drain (i.e., force the movement of) packets in the network that may be involved in a cyclic dependency. As deadlocks are a rare occurrence, draining can be performed infrequently and at low cost. Unlike prior solutions, DRAIN eliminates not only routing-level but also protocol-level deadlocks without the need for expensive virtual networks. DRAIN dramatically simplifies deadlock freedom for irregular topologies and networks that are prone to wear-related faults. Our evaluations show that on an average, DRAIN can save 26.73% packet latency compared to proactive deadlock-freedom schemes in the presence of faults while saving 77.6% power compared to reactive schemes.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131345217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}