P. Luszczek, Y. Tsai, Neil Lindquist, H. Anzt, J. Dongarra
{"title":"Scalable Data Generation for Evaluating Mixed-Precision Solvers","authors":"P. Luszczek, Y. Tsai, Neil Lindquist, H. Anzt, J. Dongarra","doi":"10.1109/HPEC43674.2020.9286145","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286145","url":null,"abstract":"We present techniques of generating data for mixed precision solvers that allows to test those solvers in a scalable manner. Our techniques focus on mixed precision hardware and software where both the solver and the hardware can take advantage of mixing multiple floating precision formats. This allows taking advantage of recently released generation of hardware platforms that focus on ML and DNN workloads but can also be utilized for HPC applications if a new breed of algorithms is combined with the custom floating-point formats to deliver performance levels beyond the standard IEEE data types while delivering a comparable accuracy of the results.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125760586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud
{"title":"Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms","authors":"M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud","doi":"10.1109/HPEC43674.2020.9286195","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286195","url":null,"abstract":"Deep Neural Network (DNN) training and inference are two resource-intensive tasks that are usually scaled out using data or model parallelism where data parallelism parallelizes over the input data and model parallelism parallelizes over the network. Also, dense matrix-matrix multiplication is the key primitive behind training/inference of dense DNNs. On the contrary, sparse DNNs are less resource-intensive compared to their dense counterparts while offering comparable accuracy. Similarly, they can be parallelized using data or model parallelism with Sparse Matrix-Matrix Multiplication (SpMM) as the key primitive. To scale out, both data and model parallelisms initially use data parallelism to partition the input data among multiple machines. This initial partitioning of the input makes data and model parallelisms performance prone to load imbalance as partitions may be imbalanced. As part of this paper, we take a deeper look into data and model parallelisms and closely study the mechanics of the SpMM used for each. Moreover, to intuitively remedy their load imbalance problem, we incorporate hashing as a simple yet powerful method to address load imabalance. Finally, we use the IEEE HPEC sparse DNN challenge dataset to evaluate the performance of data and model parallelisms at scale. We scaled up to 32 machines (896 cores) and inferred a large sparse DNN with 4B parameters in 51 seconds. Results suggest that with hashing, data and model parallelisms achieve super-linear speedup due to better load balance and cache utilization.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117165823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna
{"title":"How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms","authors":"Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286150","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286150","url":null,"abstract":"Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131416268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/HPEC43674.2020.9286196","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286196","url":null,"abstract":"Removal of speckle noise from synthetic aperture radar (SAR) remains a challenging obstacle for onboard processing. Speckle noise is substantially harder to address than Gaussian noise due to its multiplicative nature. The probability patch-based (PPB) filter is based on the nonlocal means filter and can reduce speckle noise while preserving fine details. However, its high computational complexity inhibits its practical use in embedded applications. This is especially true for conventional space platforms, where radiation-hardened processors have significantly lower performance and energy-efficiency than their commercial-off-the-shelf counterparts. Combined with ever-increasing demands for data processing requirements and an emphasis on intelligent, autonomous systems, there is a need to enhance computing capabilities of space platforms for present and future missions. Recently, use of hybrid system-on-chip (SoC) devices in space applications has been increasingly adopted. Namely, CPU+FPGA devices present several opportunities for efficient acceleration of many applications in comparison to software-only architectures. In this paper, a detailed description of a CPU+FPGA accelerator for speckle noise removal implementing the PPB filter is presented. The proposed architecture leverages the strengths of CPUs and FPGAs to maximize performance. Studying the dataflow and computation properties of the algorithm allow for a highly parallelized and fully pipelined design. When evaluated on the Xilinx Z-7045 SoC, our proposed architecture shows a significant execution time improvement (up to ~750x) over a software-only baseline while maintaining modest FPGA resource utilization. To verify its function, filtering quality is evaluated using images artificially corrupted by simulated speckle noise as well as real SAR images. Quantitative analysis shows that use of the hardware design only introduces negligible quality loss.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114512990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunshu Wu, Tong Geng, Chen Yang, Vipin Sachdeva, W. Sherman, Martin C. Herbordt
{"title":"A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics","authors":"Chunshu Wu, Tong Geng, Chen Yang, Vipin Sachdeva, W. Sherman, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286146","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286146","url":null,"abstract":"Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide a communication capability not available with any other COTS computing technology. Parallelization of one part of the MD computation, the 3D FFT, has been studied previously; for likely FPGA cluster sizes, however, the range-limited computation (RL) is more challenging. The motivation here is that the direct replication of the single-chip design suffers from inefficient inter-board bandwidth usage. In particular, although communication in RL is local, likely bandwidth limitations will constrain performance unless great care is taken in design and analysis. In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of onboard transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit. In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of onboard transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"400 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123961862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Geng, Chunshu Wu, Cheng Tan, B. Fang, Ang Li, Martin C. Herbordt
{"title":"CQNN: a CGRA-based QNN Framework","authors":"Tong Geng, Chunshu Wu, Cheng Tan, B. Fang, Ang Li, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286194","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286194","url":null,"abstract":"Quantized Neural Networks (QNNs) have drawn tremendous attention since - when compared with Convolution Neural Networks (CNNs) - they often dramatically reduce computation, communication, and storage demands with negligible loss in accuracy. To find an optimal balance between performance and accuracy, developers use different data-widths for different layers and channels. Given this large parameter space, it is challenging to design a QNN accelerator which is generally efficient for various and flexible model configurations. In this paper we propose CQNN, a novel Coarse-Grained Reconfigurable Architecture-based (CGRA) QNN acceleration framework. CQNN has a large number of basic components for binary functions. By programming CQNN at runtime according to the target QNN model, these basic components are integrated to support QNN functions with any data-width and hyperparameter requirements. The result is an optimal QNN for the target model. The framework includes compiler, hardware design, simulator, and RTL generator. Experimental results show CQNNs can complete the inference of AlexNet and VGG-16 within 0.13ms and 2.63ms, respectively. We demonstrate the design on an FPGA platform; however, this is only for showcasing the method: the approach does not rely on any FPGA-specific features and can thus be implemented as ASIC as well.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115733264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Aananthakrishnan, R. Pawlowski, J. Fryman, I. Hur
{"title":"Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture","authors":"S. Aananthakrishnan, R. Pawlowski, J. Fryman, I. Hur","doi":"10.1109/HPEC43674.2020.9286245","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286245","url":null,"abstract":"Intel PIUMA is a novel architecture tailored for graph analytics. SpMV is a core component of graph analytics and we report on an early performance study of SpMV on the Intel PIUMA architecture.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino
{"title":"iBench: a Distributed Inference Simulation and Benchmark Suite","authors":"W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino","doi":"10.1109/HPEC43674.2020.9286169","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286169","url":null,"abstract":"We present a novel distributed inference benchmarking system, called “iBench”, that provides relevant performance metrics for high-performance edge computing systems using trained deep learning models. The proposed benchmark is unique in that it includes data transfer performance through a distributed system, such as a supercomputer, using clients and servers to provide a system-level benchmark. iBench is flexible and robust enough to allow for the benchmarking of custom-built inference servers. This was demonstrated through the development of a custom Flask-based inference server to serve MLPerf's official ResNet50v1.5 model. In this paper, we compare iBench against MLPerf inference performance on an 8-V100 GPU node. iBench is shown to provide two primary advantages over MLPerf: (1) the ability to measure distributed inference performance, and (2) a more realistic measure of benchmark performance for inference servers on HPC by taking into account additional factors to inference time, such as HTTP request-response time, payload pre-processing and packing time, and invest time.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128654521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architectural Analysis of Deep Learning on Edge Accelerators","authors":"Luke Kljucaric, A. Johnson, A. George","doi":"10.1109/HPEC43674.2020.9286209","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286209","url":null,"abstract":"As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated N CS2 design showed the best generalizability in performance and efficiency across neural networks.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Hawthorne-Madell, Michael P. Kapralos, R. Blaine, Suzanne J. Matthews
{"title":"Evaluating Cryptographic Performance of Raspberry Pi Clusters","authors":"Daniel Hawthorne-Madell, Michael P. Kapralos, R. Blaine, Suzanne J. Matthews","doi":"10.1109/HPEC43674.2020.9286247","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286247","url":null,"abstract":"ARM-based single board computers (SBCs) such as the Raspberry Pi capture the imaginations of hobbyists and scientists due to their low cost and versatility. With the deluge of data produced in edge environments, SBCs and SBC clusters have emerged as low-cost platform for data collection and analysis. Simultaneously, security is a growing concern as new regulations require secure communication for data collected from the edge. In this paper, we compare the performance of a Raspberry Pi cluster to a power-efficient next unit of computing (NUC) and a midrange desktop (MRD) on three leading cryptographic algorithms (AES, Twofish, and Serpent) and assess the general-purpose performance of the three systems using the HPL benchmark. Our results suggest that hardware-level instruction sets for all three cryptographic algorithms should be implemented on single board computers to aid with secure data transfer on the edge.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124719110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}