Eric Qin, A. Samajdar, Hyoukjun Kwon, V. Nadella, S. Srinivasan, Dipankar Das, Bharat Kaul, T. Krishna
{"title":"SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training","authors":"Eric Qin, A. Samajdar, Hyoukjun Kwon, V. Nadella, S. Srinivasan, Dipankar Das, Bharat Kaul, T. Krishna","doi":"10.1109/HPCA47549.2020.00015","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00015","url":null,"abstract":"The advent of Deep Learning (DL) has radically transformed the computing industry across the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it has become synonymous with a genre of workloads across vision, speech, language, recommendations, robotics, and games. The key compute kernel within most DL workloads is general matrix-matrix multiplications (GEMMs), which appears frequently during both the forward pass (inference and training) and backward pass (training). GEMMs are a natural choice for hardware acceleration to speed up training, and have led to 2D systolic architectures like NVIDIA tensor cores and Google Tensor Processing Unit (TPU). Unfortunately, emerging GEMMs in DL are highly irregular and sparse, which lead to poor data mappings on systolic architectures. This paper proposes SIGMA, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity. Within SIGMA includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN). SIGMA performs 5.7x better than systolic array architectures for irregular sparse matrices, and roughly 3x better than state-of-the-art sparse accelerators. We demonstrate an instance of SIGMA operating at 10.8 TFLOPS efficiency across arbitrary levels of sparsity, with a 65.10 mm^2 and 22.33 W footprint on a 28 nm process.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127068041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/hpca47549.2020.00003","DOIUrl":"https://doi.org/10.1109/hpca47549.2020.00003","url":null,"abstract":"","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126985024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Jokar, Junyi Qiu, F. Chong, L. Goddard, J. Dallesasse, M. Feng, Yanjing Li
{"title":"Baldur: A Power-Efficient and Scalable Network Using All-Optical Switches","authors":"M. Jokar, Junyi Qiu, F. Chong, L. Goddard, J. Dallesasse, M. Feng, Yanjing Li","doi":"10.1109/HPCA47549.2020.00022","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00022","url":null,"abstract":"We present the first all-optical network, Baldur, to enable power-efficient and high-speed communications in future exascale computing systems. The essence of Baldur is its ability to perform packet routing on-the-fly in the optical domain using an emerging technology called the transistor laser (TL), which presents interesting opportunities and challenges at the system level. Optical packet switching readily eliminates many inefficiencies associated with the crossings between optical and electrical domains. However, TL gates consume high power at the current technology node, which makes TL-based buffering and optical clock recovery impractical. Consequently, we must adopt novel (bufferless and clock-less) architecture and design approaches that are substantially different from those used in current networks. At the architecture level, we support a bufferless design by turning to techniques that have fallen out of favor for current networks. Baldur uses a low-radix, multi-stage network with a simple routing algorithm that drops packets to handle congestion, and we further incorporate path multiplicity and randomness to minimize packet drops. This design also minimizes the number of TL gates needed in each switch. At the logic design level, a non-conventional, length-based data encoding scheme is used to eliminate the need for clock recovery. We thoroughly validate and evaluate Baldur using a circuit simulator and a network simulator. Our results show that Baldur achieves up to 3,000X lower average latency while consuming 3.2X-26.4X less power than various state-of-the art networks under a wide variety of traffic patterns and real workloads, for the scale of 1,024 server nodes. Baldur is also highly scalable, since its power per node stays relatively constant as we increase the network size to over 1 million server nodes, which corresponds to 14.6X-31.0X power improvements compared to state-of-the-art networks at this scale.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1997 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128214179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Barredo, J. M. Cebrian, Miquel Moretó, Marc Casas, M. Valero
{"title":"Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions","authors":"Adrián Barredo, J. M. Cebrian, Miquel Moretó, Marc Casas, M. Valero","doi":"10.1109/HPCA47549.2020.00064","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00064","url":null,"abstract":"Vector processors offer a wide range of unexplored opportunities to improve performance and energy efficiency. However, despite its potential, vector code generation and execution have significant challenges, the most relevant ones being control flow divergence. Most modern processors including SIMD extensions (such as AVX) rely on predication to support divergence control. In predicated codes, performance and energy consumption are usually insensitive to the number of true values in a predicated mask. This implies that the system efficiency becomes sub-optimal as vector length increases. In this paper we focus on SIMD extensions and propose a novel approach to improve execution efficiency in predicated SIMD instructions, the Compaction/Restoration (CR) technique. CR delays predicated SIMD instructions with inactive elements and compacts them with instances of the same instruction from different loop iterations to form an equivalent dense vector instruction, where, in the best case, all the elements are active. After executing such dense instructions, their results are restored to the original instructions. Our evaluation shows that CR improves performance by up to 25% and reduces dynamic energy consumption by up to 43% on real unmodified applications with predicated execution. Moreover, CR allows executing unmodified legacy code with short vector instructions (AVX-2) on newer architectures with wider vectors (AVX-512), achieving up to 56% performance benefits.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124024542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis
{"title":"Hybrid2: Combining Caching and Migration in Hybrid Memory Systems","authors":"E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis","doi":"10.1109/HPCA47549.2020.00059","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00059","url":null,"abstract":"This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past, the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related to data-transfer granularity and metadata management. This paper proposes Hybrid2, a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments. It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122939009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuan Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, Yiqun Guo, Xiaowei Jiang, Lingbo Tang, Yin Du, Yingya Zhang, Pan Pan, Yuan Xie
{"title":"EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform","authors":"Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuan Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, Yiqun Guo, Xiaowei Jiang, Lingbo Tang, Yin Du, Yingya Zhang, Pan Pan, Yuan Xie","doi":"10.1109/HPCA47549.2020.00056","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00056","url":null,"abstract":"Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions for applications such as image classification, object detection, speech recognition, and so forth. Its great success comes with excessive trainings to make sure the model accuracy is good enough for those applications. Nowadays, it becomes challenging to train a DNN model because of 1) the model size and data size keep increasing, which usually needs more iterations to train; 2) DNN algorithms evolve rapidly, which requires the training phase to be short for a quick deployment. To address those challenges, distributed training platforms have been proposed to leverage massive server nodes for training, with the hope of significant training time reduction. Therefore, scalability is a critical performance metric to evaluate a distributed training platform. Nevertheless, our analysis reveals that traditional server clusters have poor scalability for training due to the traffic congestions within the server and beyond. The intra-server traffic on the I/O fabric can result in severe congestions and skewed quality of service as high performance devices are competing with each other. Moreover, the traffic congestions on the Ethernet for inter-server communication could also incur significant performance degradation. In this work, we devise a novel distributed training platform, EFLOPS, that adopts an algorithm and system co-design methodology to achieve good scalability. A new server architecture is proposed to alleviate the intra-server congestions. Moreover, a new network topology, BiGraph, is proposed to divide the network into two separate parts, so that there is always a direct connection between any nodes from different parts. Finally, accompany with BiGraph, a topology-aware allreduce algorithm is proposed to eliminate the traffic congestion on the direct connection. The experimental results show that eliminating the congestions on network interface can gain up to 11.3xcommunication speedup. The proposed algorithm and topology can provide further improvement up to 6.08x. The overall performance of ResNet-50 training achieves near-linear scalability, and is competitive to the top-rankings of MLPerf results.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126302853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Sangaiah, Michael Lui, Ragh Kuttappa, B. Taskin, Mark Hempstead
{"title":"SnackNoC: Processing in the Communication Layer","authors":"K. Sangaiah, Michael Lui, Ragh Kuttappa, B. Taskin, Mark Hempstead","doi":"10.1109/HPCA47549.2020.00045","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00045","url":null,"abstract":"In this work, we propose and evaluate a Network-on-Chip (NoC) augmented with light-weight processing elements to provide a lean dataflow-style system. We show that contemporary NoC routers can frequently experience long periods of idle time, with less than 10% link utilization in HPC applications. By repurposing the temporal and spatial slack of the NoC, the proposed platform, SnackNoC, is able to compute linear algebra kernels efficiently within the communication layer with minimal additional resource costs. SnackNoC 'Snack' application kernels are programmed with a producer-consumer data model that uses the NoC slack to store and transmit intermediate data between processing elements. SnackNoC is demonstrated in a multi-program environment that continually executes linear algebra kernels on the NoC simultaneously with chip multiprocessor (CMP) applications on the processor cores. Linear algebra kernels are computed up to 14.2x faster on SnackNoC compared to an Intel Haswell EPx86 processing core. The cost of executing 'snack' kernels in parallel to the CMP applications is a minimal runtime impact of 0.01% to 0.83% due to higher link utilization, and an uncore area overhead of 1.1%.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132655643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changmin Lee, Wonjae Shin, D. Kim, Yong-Ho Yu, Sung-Joon Kim, Taekyeong Ko, Deokho Seo, Jongmin Park, Kwanghee Lee, Seon-Jun Choi, Namhyung Kim, G. Vishak, A. George, V. Vishwas, Donghun Lee, Kang-Woo Choi, Chang-In Song, Dohan Kim, Insu Choi, I. Jung, Y. Song, Jinman Han
{"title":"NVDIMM-C: A Byte-Addressable Non-Volatile Memory Module for Compatibility with Standard DDR Memory Interfaces","authors":"Changmin Lee, Wonjae Shin, D. Kim, Yong-Ho Yu, Sung-Joon Kim, Taekyeong Ko, Deokho Seo, Jongmin Park, Kwanghee Lee, Seon-Jun Choi, Namhyung Kim, G. Vishak, A. George, V. Vishwas, Donghun Lee, Kang-Woo Choi, Chang-In Song, Dohan Kim, Insu Choi, I. Jung, Y. Song, Jinman Han","doi":"10.1109/HPCA47549.2020.00048","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00048","url":null,"abstract":"Currently, there are two representative non-volatile dual in-line memory module (NVDIMM) interfaces: a proprietary Intel DDR-T and the JEDEC NVDIMM-P, which are not supported by existing platforms. Adoption of new platform is costly and measuring its efficiency of migrating to the new platform is much more complex. This study is an alternative way of them—finding a new memory device that can be supported by all existing systems. In this paper, we propose an NVDIMM architecture with several system-wide mechanisms to allow the synchronous DDR4 memory interfaces to support non-deterministic (asynchronous) timing. The proposed memory architecture is implemented as a real device prototype, and also evaluated using synthetic and real workloads on an x86-64 server system.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125597442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, A. Joshi, Norman Rubin, John Kim, D. Kaeli
{"title":"Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems","authors":"Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, A. Joshi, Norman Rubin, John Kim, D. Kaeli","doi":"10.1109/HPCA47549.2020.00055","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00055","url":null,"abstract":"As transistor scaling becomes increasingly more difficult to achieve, scaling the core count on a single GPU chip has also become extremely challenging. As the volume of data to process in today's increasingly parallel workloads continues to grow unbounded, we need to find scalable solutions that can keep up with this increasing demand. To meet the need of modern-day parallel applications, multi-GPU systems offer a promising path to deliver high performance and large memory capacity. However, multi-GPU systems suffer from performance issues associated with GPU-to-GPU communication and data sharing, which severely impact the benefits of multi-GPU systems. Programming multi-GPU systems has been made considerably simpler with the advent of Unified Memory which enables runtime migration of pages to the GPU on demand. Current multi-GPU systems rely on a first-touch Demand Paging scheme, where memory pages are migrated from the CPU to the GPU on the first GPU access to a page. The data sharing nature of GPU applications makes deploying an efficient programmer-transparent mechanism for inter-GPU page migration challenging. Therefore following the initial CPU-to-GPU page migration, the page is pinned on that GPU. Future accesses to this page from other GPUs happen at a cache-line granularity – pages are not transferred between GPUs without significant programmer intervention. We observe that this mechanism suffers from two major drawbacks: 1) imbalance in the page distribution across multiple GPUs, and 2) inability to move the page to the GPU that uses it most frequently. Both of these problems lead to load imbalance across GPUs, degrading the performance of the multi-GPU system. To address these problems, we propose Griffin, a holistic hardware-software solution to improve the performance of NUMA multi-GPU systems. Griffin introduces programmer-transparent modifications to both the IOMMU and GPU architecture, supporting efficient runtime page migration based on locality information. In particular, Griffin employs a novel mechanism to detect and move pages at runtime between GPUs, increasing the frequency of resolving accesses locally, which in turn improves the performance. To ensure better load balancing across GPUs, Griffin employs a Delayed First-Touch Migration policy that ensures pages are evenly distributed across multiple GPUs. Our results on a diverse set of multi-GPU workloads show that Griffin can achieve up to a 2.9× speedup on a multi-GPU system, while incurring low implementation overhead.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131979295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers","authors":"Tirthak Patel, Devesh Tiwari","doi":"10.1109/HPCA47549.2020.00025","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00025","url":null,"abstract":"Large-scale data centers run latency-critical jobs with quality-of-service (QoS) requirements, and throughput-oriented background jobs, which need to achieve high perfor-mance. Previous works have proposed methods which cannot co-locate multiple latency-critical jobs with multiple back-grounds jobs while: (1) meeting the QoS requirements of all latency-critical jobs, and (2) maximizing the performance of the background jobs. This paper proposes CLITE, a Bayesian Optimization-based, multi-resource partitioning technique which achieves these goals. CLITE is publicly available at https://github.com/GoodwillComputingLab/CLITE.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127249899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}