2020 IEEE High Performance Extreme Computing Conference (HPEC)最新文献_第5页

Scalable Data Generation for Evaluating Mixed-Precision Solvers 用于评估混合精度求解器的可扩展数据生成

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286145

P. Luszczek, Y. Tsai, Neil Lindquist, H. Anzt, J. Dongarra

引用次数: 0

Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms 研究稀疏深度神经网络哈希对数据和模型并行性的影响

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286195

M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud

{"title":"Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms","authors":"M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud","doi":"10.1109/HPEC43674.2020.9286195","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286195","url":null,"abstract":"Deep Neural Network (DNN) training and inference are two resource-intensive tasks that are usually scaled out using data or model parallelism where data parallelism parallelizes over the input data and model parallelism parallelizes over the network. Also, dense matrix-matrix multiplication is the key primitive behind training/inference of dense DNNs. On the contrary, sparse DNNs are less resource-intensive compared to their dense counterparts while offering comparable accuracy. Similarly, they can be parallelized using data or model parallelism with Sparse Matrix-Matrix Multiplication (SpMM) as the key primitive. To scale out, both data and model parallelisms initially use data parallelism to partition the input data among multiple machines. This initial partitioning of the input makes data and model parallelisms performance prone to load imbalance as partitions may be imbalanced. As part of this paper, we take a deeper look into data and model parallelisms and closely study the mechanics of the SpMM used for each. Moreover, to intuitively remedy their load imbalance problem, we incorporate hashing as a simple yet powerful method to address load imabalance. Finally, we use the IEEE HPEC sparse DNN challenge dataset to evaluate the performance of data and model parallelisms at scale. We scaled up to 32 machines (896 cores) and inferred a large sparse DNN with 4B parameters in 51 seconds. Results suggest that with hashing, data and model parallelisms achieve super-linear speedup due to better load balance and cache utilization.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117165823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms 如何有效地训练你的AI代理?异构平台上深度强化学习的表征与评价

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286150

Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna

{"title":"How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms","authors":"Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286150","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286150","url":null,"abstract":"Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131416268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery 基于非局部均值的SAR图像散斑噪声去除的硬件加速

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286196

Hector A. Li Sanchez, A. George

{"title":"Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/HPEC43674.2020.9286196","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286196","url":null,"abstract":"Removal of speckle noise from synthetic aperture radar (SAR) remains a challenging obstacle for onboard processing. Speckle noise is substantially harder to address than Gaussian noise due to its multiplicative nature. The probability patch-based (PPB) filter is based on the nonlocal means filter and can reduce speckle noise while preserving fine details. However, its high computational complexity inhibits its practical use in embedded applications. This is especially true for conventional space platforms, where radiation-hardened processors have significantly lower performance and energy-efficiency than their commercial-off-the-shelf counterparts. Combined with ever-increasing demands for data processing requirements and an emphasis on intelligent, autonomous systems, there is a need to enhance computing capabilities of space platforms for present and future missions. Recently, use of hybrid system-on-chip (SoC) devices in space applications has been increasingly adopted. Namely, CPU+FPGA devices present several opportunities for efficient acceleration of many applications in comparison to software-only architectures. In this paper, a detailed description of a CPU+FPGA accelerator for speckle noise removal implementing the PPB filter is presented. The proposed architecture leverages the strengths of CPUs and FPGAs to maximize performance. Studying the dataflow and computation properties of the algorithm allow for a highly parallelized and fully pipelined design. When evaluated on the Xilinx Z-7045 SoC, our proposed architecture shows a significant execution time improvement (up to ~750x) over a software-only baseline while maintaining modest FPGA resource utilization. To verify its function, filtering quality is evaluated using images artificially corrupted by simulated speckle noise as well as real SAR images. Quantitative analysis shows that use of the hardware design only introduces negligible quality loss.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114512990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics 限程分子动力学的高效通信多芯片设计

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286146

Chunshu Wu, Tong Geng, Chen Yang, Vipin Sachdeva, W. Sherman, Martin C. Herbordt

{"title":"A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics","authors":"Chunshu Wu, Tong Geng, Chen Yang, Vipin Sachdeva, W. Sherman, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286146","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286146","url":null,"abstract":"Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide a communication capability not available with any other COTS computing technology. Parallelization of one part of the MD computation, the 3D FFT, has been studied previously; for likely FPGA cluster sizes, however, the range-limited computation (RL) is more challenging. The motivation here is that the direct replication of the single-chip design suffers from inefficient inter-board bandwidth usage. In particular, although communication in RL is local, likely bandwidth limitations will constrain performance unless great care is taken in design and analysis. In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of onboard transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit. In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of onboard transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"400 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123961862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

CQNN: a CGRA-based QNN Framework CQNN:基于cgra的QNN框架

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286194

Tong Geng, Chunshu Wu, Cheng Tan, B. Fang, Ang Li, Martin C. Herbordt

{"title":"CQNN: a CGRA-based QNN Framework","authors":"Tong Geng, Chunshu Wu, Cheng Tan, B. Fang, Ang Li, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286194","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286194","url":null,"abstract":"Quantized Neural Networks (QNNs) have drawn tremendous attention since - when compared with Convolution Neural Networks (CNNs) - they often dramatically reduce computation, communication, and storage demands with negligible loss in accuracy. To find an optimal balance between performance and accuracy, developers use different data-widths for different layers and channels. Given this large parameter space, it is challenging to design a QNN accelerator which is generally efficient for various and flexible model configurations. In this paper we propose CQNN, a novel Coarse-Grained Reconfigurable Architecture-based (CGRA) QNN acceleration framework. CQNN has a large number of basic components for binary functions. By programming CQNN at runtime according to the target QNN model, these basic components are integrated to support QNN functions with any data-width and hyperparameter requirements. The result is an optimal QNN for the target model. The framework includes compiler, hardware design, simulator, and RTL generator. Experimental results show CQNNs can complete the inference of AlexNet and VGG-16 within 0.13ms and 2.63ms, respectively. We demonstrate the design on an FPGA platform; however, this is only for showcasing the method: the approach does not rely on any FPGA-specific features and can thus be implemented as ASIC as well.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115733264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture 基于Intel PIUMA架构的高效稀疏矩阵向量乘法

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286245

S. Aananthakrishnan, R. Pawlowski, J. Fryman, I. Hur

引用次数: 1

iBench: a Distributed Inference Simulation and Benchmark Suite iBench:一个分布式推理模拟和基准测试套件

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286169

W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino

{"title":"iBench: a Distributed Inference Simulation and Benchmark Suite","authors":"W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino","doi":"10.1109/HPEC43674.2020.9286169","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286169","url":null,"abstract":"We present a novel distributed inference benchmarking system, called “iBench”, that provides relevant performance metrics for high-performance edge computing systems using trained deep learning models. The proposed benchmark is unique in that it includes data transfer performance through a distributed system, such as a supercomputer, using clients and servers to provide a system-level benchmark. iBench is flexible and robust enough to allow for the benchmarking of custom-built inference servers. This was demonstrated through the development of a custom Flask-based inference server to serve MLPerf's official ResNet50v1.5 model. In this paper, we compare iBench against MLPerf inference performance on an 8-V100 GPU node. iBench is shown to provide two primary advantages over MLPerf: (1) the ability to measure distributed inference performance, and (2) a more realistic measure of benchmark performance for inference servers on HPC by taking into account additional factors to inference time, such as HTTP request-response time, payload pre-processing and packing time, and invest time.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128654521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Architectural Analysis of Deep Learning on Edge Accelerators 边缘加速器上深度学习的体系结构分析

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286209

Luke Kljucaric, A. Johnson, A. George

{"title":"Architectural Analysis of Deep Learning on Edge Accelerators","authors":"Luke Kljucaric, A. Johnson, A. George","doi":"10.1109/HPEC43674.2020.9286209","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286209","url":null,"abstract":"As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated N CS2 design showed the best generalizability in performance and efficiency across neural networks.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Evaluating Cryptographic Performance of Raspberry Pi Clusters 评估树莓派集群的加密性能

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI: 10.1109/HPEC43674.2020.9286247

Daniel Hawthorne-Madell, Michael P. Kapralos, R. Blaine, Suzanne J. Matthews

引用次数: 5