{"title":"Embedded GPU Cluster Computing Framework for Inference of Convolutional Neural Networks","authors":"Evan T. Kain, Diego Wildenstein, A. Pineda","doi":"10.1109/HPEC.2019.8916253","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916253","url":null,"abstract":"The growing need for on-board image processing for space vehicles requires computing solutions that are both low-power and high-performance. Parallel computation using low-power embedded Graphics Processing Units (GPUs) satisfy both requirements. Our experiment involves the use of OpenMPI domain decomposition of an image processing algorithm based upon a pre-trained convolutional neural network (CNN) developed by the U.S. Air Force Research Laboratory (AFRL). Our testbed consists of six NVIDIA Jetson TX2 development boards operating in parallel. This parallel framework results in a speedup of $4.3 times $ on six processing nodes. This approach also leads to a linear decay in parallel efficiency as more processing nodes are added to the network. By replicating the data across processors in addition to distributing, we also characterize the best-case impact of adding triple modular redundancy (TMR) to our application.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123545639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs","authors":"Hao Wen, W. Zhang","doi":"10.1109/HPEC.2019.8916551","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916551","url":null,"abstract":"Breadth-first search (BFS) is a basis for graph search and a core building block for many higher-level graph analysis applications. However, BFS is also a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. In addition, the number of nodes in each layer of the graph is also irregular. Therefore, the number of active GPU threads is different for each layer of execution. These irregularities limit the parallelism of BFS executing on GPUs.Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently, and the data set also becomes more regular.Our experimental results show that the VBFS achieves significant speedup over the current GPU implementation of BFS from the Rodinia benchmark [4], and the energy efficiency is also improved.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131954787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Bobda, H. Ishebabi, Philipp Mahr, Joel Mandebi Mbongue, S. Saha
{"title":"MeXT: A Flow for Multiprocessor Exploration","authors":"C. Bobda, H. Ishebabi, Philipp Mahr, Joel Mandebi Mbongue, S. Saha","doi":"10.1109/HPEC.2019.8916428","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916428","url":null,"abstract":"This paper presents an extended design approach for heterogeneous multiprocessor systems. The goal in this particular design exploration approach is to ease the implementation of an adaptive multiprocessor system by creating components such as processing nodes or memories from an application. A program is profiled and analysed to gather information about task precedence, communication cost or computational patterns for hardware accelerator generation. This information is then used to solve an optimization problem using Integer Linear Programming or Answer Set Programming with the goal of 1) creating suitable multiprocessor hardware architecture and 2) mapping of tasks onto the processors. A lightweight message-passing library for on-chip communication of parallel programs is provided. The resulting abstract architecture is further processed using the vendor tool-chain to generate the target platform’s configuration. Two real-world case studies are used to show the feasibility of our design-space exploration approach.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132474056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast BFS-Based Triangle Counting on GPUs","authors":"Leyuan Wang, John Douglas Owens","doi":"10.1109/HPEC.2019.8916434","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916434","url":null,"abstract":"In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of $3.84 times $.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125715982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient implementation of sparse matrix-sparse vector multiplication for large scale graph analytics","authors":"M. Serrano","doi":"10.1109/HPEC.2019.8916413","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916413","url":null,"abstract":"We developed a parallel algorithm to improve the cache behavior and overall performance for multiplication of sparse matrices with sparse vectors (SpMSpV), an operation used increasingly in large graph analytics, particularly dynamic graphs in social networks and homeland security applications. The proposed algorithm builds upon the two-phase approach of partitioning the multiplication into a scaling phase and an aggregation phase, to achieve more cache-friendly access patterns individually in each phase [6], [3]. However, to handle dynamic graphs and achieve better load balancing for parallel implementation, we use a combination of private and shared bins, with synchronized access to shared bins to exchange the product terms between the two phases. The new algorithm accumulates product terms in private bins for each thread. The algorithm then performs a bulk transfer between a private bin and a shared bin, when the private bin becomes full. Then results are aggregated from the shared bins. In addition, we employ heuristics to decide the best algorithm for SpMSpV based on the number of nonzeros involved in the operation. When the number of nonzeros is large, it may be better to perform the operation as SpMV (sparse matrix times dense vector) despite the added conversion cost. Also, if the number of nonzeros is low it is advantageous to use a simplified algorithm. We compared our algorithm with existing algorithms for SpMSpV, and our evaluation shows that execution time is reduced by several times when large graphs are considered.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128842052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdelrahman Elkanishy, Derrick T. Rivera, Abdel-Hameed A. Badawy, P. Furth, Z. Saifullah, Christopher P. Michael
{"title":"An FPGA Decision Tree Classifier to Supervise a Communication SoC","authors":"Abdelrahman Elkanishy, Derrick T. Rivera, Abdel-Hameed A. Badawy, P. Furth, Z. Saifullah, Christopher P. Michael","doi":"10.1109/HPEC.2019.8916459","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916459","url":null,"abstract":"Wireless communication protocols are used in all smart devices and systems. This work is part of a proposed supervisory circuit that classifies the operation of a communication SoC, in particular, a Bluetooth (BT) SoC, at a low sampling frequency by monitoring the RF output power and input supply current. In essence, the goal is to inexpensively fabricate an RF envelope detector, power supply current monitor, and classifier on a low-cost, low-frequency integrated circuit. When the supervisory circuit detects abnormal behavior, it can shut off power to the BT chip. We extract simple descriptive features from the input and output power signals. Then, we train a machine learning (ML) model to classify the different BT operation modes, such as advertising and transmit/receive modes. In this work, we implemented the ML classifier and feature extraction on an FPGA with 100% matching with the corresponding MATLAB code. In the experimental setup, which included a function generator and an on-board ADC, errors in the FPGA-sampled values degraded the match slightly to 99.26%. Finally, a low-power ASIC is synthesized from the Verilog code in $0.18-mu mathrm{m}$ CMOS, with an estimated area of 0.0152 mm2 and power of $9.43 mu mathrm{W}$.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129426763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Resource Allocation for Parallel Reservoir Simulation","authors":"Suha N. Kayum, M. Rogowski","doi":"10.1109/HPEC.2019.8916420","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916420","url":null,"abstract":"Over the past few decades, the oil and gas (O&G) industry has become heavily dependent on parallel scientific computing. The turnaround time of such applications depends heavily on the amount of resources dedicated to the task. Increasing the number of compute processes for the same job tends to produce diminishing returns, and does not always guarantee an increase in performance of a justified impact. This point describes scalability limits, which this work aims to avoid surpassing. An algorithm is presented in which a reservoir simulation run automatically adjusts and finds the optimal resources, which leads to improved performance, and the efficient utilization of compute resources, resulting in significant cost savings.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115468105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Kepner, Simon Alford, V. Gadepally, Michael Jones, Lauren Milechin, Ryan A. Robinett, S. Samsi
{"title":"Sparse Deep Neural Network Graph Challenge","authors":"J. Kepner, Simon Alford, V. Gadepally, Michael Jones, Lauren Milechin, Ryan A. Robinett, S. Samsi","doi":"10.1109/HPEC.2019.8916336","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916336","url":null,"abstract":"The MIT/IEEE/Amazon GraphChallenge.org encourages community approaches to developing new solutions for analyzing graphs and sparse data. Sparse AI analytics present unique scalability difficulties. The proposed Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reflective of emerging sparse AI systems. The Sparse DNN Challenge is based on a mathematically well-defined DNN inference computation and can be implemented in any programming environment. Sparse DNN inference is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The input data sets are derived from the MNIST handwritten letters. The surrounding I/O and verification provide the context for each sparse DNN inference that allows rigorous definition of both the input and the output. Furthermore, since the proposed sparse DNN challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Reference implementations have been implemented and their serial and parallel performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115176451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner
{"title":"Survey and Benchmarking of Machine Learning Accelerators","authors":"A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner","doi":"10.1109/HPEC.2019.8916327","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916327","url":null,"abstract":"Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, compare those results to the reported performance and power consumption values and evaluate them against an Intel CPU that is used in some embedded applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114156484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir","authors":"T. Schardl, S. Samsi","doi":"10.1109/HPEC.2019.8916312","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916312","url":null,"abstract":"This work introduces TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low-level representation of code. Machine-learning applications employ a variety of technologies to improve performance, including compiler technology. But compilers in machine-learning frameworks lack a deep understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir, a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir’s representation of fork-join parallelism. Furthermore, TapirXLA exposes to the compiler implementations of linear-algebra library routines whose parallel operations are encoded using Tapir’s representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean multiplicative factor of 30% to 100%, across four CPU architectures.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130118481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}