2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第3页

BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs BCoal:高效安全gpu的基于桶的内存合并

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00053

Gurunath Kadam, Danfeng Zhang, Adwait Jog

{"title":"BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs","authors":"Gurunath Kadam, Danfeng Zhang, Adwait Jog","doi":"10.1109/HPCA47549.2020.00053","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00053","url":null,"abstract":"Graphics Processing Units (GPUs) are becoming a de facto choice for accelerating applications from a wide range of domains ranging from graphics to high-performance computing. As a result, it is getting increasingly desirable to improve the cooperation between traditional CPUs and accelerators such as GPUs. However, given the growing security concerns in the CPU space, closer integration of GPUs has further expanded the attack surface. For example, several side-channel attacks have shown that sensitive information can be leaked from the CPU end. In the same vein, several side-channel attacks are also now being developed in the GPU world. Overall, it is challenging to keep emerging CPU-GPU heterogeneous systems secure while maintaining their performance and energy efficiency. In this paper, we focus on developing an efficient defense mechanism for a type of correlation timing attack on GPUs. Such an attack has been shown to recover AES private keys by exploiting the relationship between the number of coalesced memory accesses and total execution time. Prior state-of-the-art defense mechanisms use inefficient randomized coalescing techniques to defend against such GPU attacks and require turning-off bandwidth conserving techniques such as caches and miss-status holding registers (MSHRs) to ensure security. To address these limitations, we propose BCoal – a new bucketing-based coalescing mechanism. BCoal significantly reduces the information leakage by always issuing pre-determined numbers of coalesced accesses (called buckets). With the help of a detailed application-level analysis, BCoal determines the bucket sizes and pads, if necessary, the number of real accesses with additional (padded) accesses to meet the bucket sizes ensuring the security against the correlation timing attack. Furthermore, BCoal generates the padded accesses such that the security is ensured even in the presence of MSHRs and caches. In effect, BCoal significantly improves GPU security at a modest performance loss.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131220564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations Tensaurus:混合稀疏密集张量计算的多功能加速器

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00062

Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, D. Albonesi, Zhiru Zhang

引用次数: 74

Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching Impala:内存中多步模式匹配的算法/架构协同设计

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00017

Elaheh Sadredini, Reza Rahimi, Marzieh Lenjani, M. Stan, K. Skadron

{"title":"Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching","authors":"Elaheh Sadredini, Reza Rahimi, Marzieh Lenjani, M. Stan, K. Skadron","doi":"10.1109/HPCA47549.2020.00017","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00017","url":null,"abstract":"High-throughput and concurrent processing of thousands of patterns on each byte of an input stream is critical for many applications with real-time processing needs, such as network intrusion detection, spam filters, virus scanners, and many more. The demand for accelerated pattern matching has motivated several recent in-memory accelerator architectures for automata processing, which is an efficient computation model for pattern matching. Our key observations are: (1) all these architectures are based on 8-bit symbol processing (derived from ASCII), and our analysis on a large set of real-world automata benchmarks reveals that the 8-bit processing dramatically underutilizes hardware resources, and (2) multi-stride symbol processing, a major source of throughput growth, is not explored in the existing in-memory solutions. This paper presents Impala, a multi-stride in-memory automata processing architecture by leveraging our observations. The key insight of our work is that transforming 8-bit processing to 4-bit processing exponentially reduces hardware resources for state-matching and improves resource utilization. This, in turn, brings the opportunity to have a denser design, and be able to utilize more memory columns to process multiple symbols per cycle with a linear increase in state-matching resources. Impala thus introduces three-fold area, throughput, and energy benefits at the expense of increased offline compilation time. Our empirical evaluations on a wide range of automata benchmarks reveal that Impala has on average 2.7X (up to 3.7X) higher throughput per unit area and 1.22X lower power consumption than Cache Automaton, which is the best performing prior work.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services 分支:多代理任务管理的并发延迟关键云服务

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00023

Rajiv Nishtala, V. Petrucci, P. Carpenter, Magnus Själander

引用次数: 47

Charge-Aware DRAM Refresh Reduction with Value Transformation 基于值转换的电荷感知DRAM刷新减少

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00060

Seikwon Kim, Wonsang Kwak, Changdae Kim, Daehyeon Baek, Jaehyuk Huh

{"title":"Charge-Aware DRAM Refresh Reduction with Value Transformation","authors":"Seikwon Kim, Wonsang Kwak, Changdae Kim, Daehyeon Baek, Jaehyuk Huh","doi":"10.1109/HPCA47549.2020.00060","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00060","url":null,"abstract":"As the memory capacity in a system has been growing, refresh operations consume increasing ratios of the total DRAM power. To reduce the power consumption of such refresh operations, this paper proposes a novel value-aware refresh reduction technique called ZERO - REFRESH which exploits zero values in memory contents. A DRAM cell can retain the discharged state without refresh operations, and ZERO - REFRESH skips refresh operations on rows with all discharged cells. For abundant unallocated memory pages in typical systems, the operating system fills them with zeros to clean the contents. For those idle pages, ZERO - REFRESH can eliminate refresh operations in an OS-transparent way without any new interface to DRAM. However, for allocated memory pages, memory contents may not have many consecutive zero values to match the refresh granularity of DRAM. To increase the frequency of zero values and to arrange them to match the refresh granularity, ZERO - REFRESH transforms the value of memory blocks to the base and delta values, inspired by the prior BDI (Base-Delta-Immediate) compression technique. Once values are converted, bits are transposed to be stored as consecutive discharged bits at the refresh granularity. Such value transformation and rearrangement can make the memory contents friendly to refresh reduction based on discharged cells. The experimental results based on simulation show that the DRAM refresh operations are reduced by 37% on average for a set of benchmark applications, if the entire memory is allocated for the applications. If the memory usage statistics collected from three data center traces are applied, the DRAM refresh operations can be reduced by 46%, 57%, and 83% respectively for the three scenarios.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121405093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Deep Learning Acceleration with Neuron-to-Memory Transformation

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00011

M. Imani, Mohammad Samragh Razlighi, Yeseong Kim, Saransh Gupta, F. Koushanfar, T. Simunic

{"title":"Deep Learning Acceleration with Neuron-to-Memory Transformation","authors":"M. Imani, Mohammad Samragh Razlighi, Yeseong Kim, Saransh Gupta, F. Koushanfar, T. Simunic","doi":"10.1109/HPCA47549.2020.00011","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00011","url":null,"abstract":"Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-theart DNNs on current systems mostly relies on either generalpurpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on-chip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which performs neuron-to-memory transformation in order to accelerate DNNs in a highly parallel architecture. RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, i.e., multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, e.g., weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their pre-computed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to improve computation efficiency further. Our evaluation shows that RAPIDNN achieves 68.4×, 49.5× energy efficiency improvement and 48.1×, 10.9× speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.5% quality loss.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115790210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

EquiNox: Equivalent NoC Injection Routers for Silicon Interposer-Based Throughput Processors EquiNox:用于基于硅中间层的吞吐量处理器的等效NoC注入路由器

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00043

Yunfan Li, Lizhong Chen

引用次数: 3

HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems 扩展缓存一致性协议跨现代分层多gpu系统

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00054

X. Ren, Daniel Lustig, Evgeny Bolotin, A. Jaleel, Oreste Villa, D. Nellans

{"title":"HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems","authors":"X. Ren, Daniel Lustig, Evgeny Bolotin, A. Jaleel, Oreste Villa, D. Nellans","doi":"10.1109/HPCA47549.2020.00054","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00054","url":null,"abstract":"Prior work on GPU cache coherence has shown that simple hardware-or software-based protocols can be more than sufficient. However, in recent years, features such as multi-chip modules have added deeper hierarchy and non-uniformity into GPU memory systems. GPU programming models have chosen to expose this non-uniformity directly to the end user through scoped memory consistency models. As a result, there is room to improve upon earlier coherence protocols that were designed only for flat single-GPU hierarchies and/or simpler memory consistency models. In this paper, we propose HMG, a cache coherence protocol designed for forward-looking multi-GPU systems. HMG strikes a balance between simplicity and performance: it uses a readily-implementable VI-like protocol to track coherence states, but it tracks sharers using a hierarchical scheme optimized for mitigating the bandwidth limitations of inter-GPU links. HMG leverages the novel scoped, non-multi-copy-atomic properties of modern GPU memory models, and it avoids the overheads of invalidation acknowledgments and transient states that were needed to support prior GPU memory models. On a 4-GPU system, HMG improves performance over a software-controlled, bulk invalidation-based coherence mechanism by 26% and over a non-hierarchical hardware cache coherence protocol by 18%, thereby achieving 97% of the performance of an idealized caching system.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123256040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

ACR: Amnesic Checkpointing and Recovery ACR:健忘症检查点和恢复

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00013

Ismail Akturk, Ulya R. Karpuzcu

引用次数: 4

DRAIN: Deadlock Removal for Arbitrary Irregular Networks 清除任意不规则网络的死锁

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00044

Mayank Parasar, Hossein Farrokhbakht, Natalie D. Enright Jerger, Paul V. Gratz, T. Krishna, Joshua San Miguel

{"title":"DRAIN: Deadlock Removal for Arbitrary Irregular Networks","authors":"Mayank Parasar, Hossein Farrokhbakht, Natalie D. Enright Jerger, Paul V. Gratz, T. Krishna, Joshua San Miguel","doi":"10.1109/HPCA47549.2020.00044","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00044","url":null,"abstract":"Correctness is a first-order concern in the design of computer systems. For multiprocessors, a primary correctness concern is the deadlock-free operation of the network and its coherence protocol; furthermore, we must guarantee the continued correctness of the network in the face of increasing faults. Designing for deadlock freedom is expensive. Prior solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to reactively resolve deadlocks as they occur. However, the precise confluence of events that lead to deadlocks is so rare that minimal resources and time should be spent to ensure deadlock freedom. To that end, we propose DRAIN, a subactive approach to remove potential deadlocks without needing to explicitly detect or avoid them. We simply let deadlocks happen and periodically drain (i.e., force the movement of) packets in the network that may be involved in a cyclic dependency. As deadlocks are a rare occurrence, draining can be performed infrequently and at low cost. Unlike prior solutions, DRAIN eliminates not only routing-level but also protocol-level deadlocks without the need for expensive virtual networks. DRAIN dramatically simplifies deadlock freedom for irregular topologies and networks that are prone to wear-related faults. Our evaluations show that on an average, DRAIN can save 26.73% packet latency compared to proactive deadlock-freedom schemes in the presence of faults while saving 77.6% power compared to reactive schemes.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131345217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15