Mark D. Barnell, Courtney Raymond, Matthew Wilson, Darrek Isereau, Chris Cicotta
{"title":"Target Classification in Synthetic Aperture Radar and Optical Imagery Using Loihi Neuromorphic Hardware","authors":"Mark D. Barnell, Courtney Raymond, Matthew Wilson, Darrek Isereau, Chris Cicotta","doi":"10.1109/HPEC43674.2020.9286246","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286246","url":null,"abstract":"Intel's novel Loihi processing chip has been used to explore new information exploitation techniques. Specifically, we analyzed two types of data (optical and radar). These data modalities and associated machine learning algorithms were used to showcase the ability of the system to address real world problems, such as object detection and classification. Intel's fully digital Loihi design is inspired by biological processes and brain functions. Neuromorphic architectures, such as Loihi, promise to improve computational efficiency for various machine learning tasks with a realizable path toward implementation into many systems, e.g., airborne computing for intelligence, surveillance and reconnaissance systems, and/or future autonomous vehicles and household appliances. With the current software development kit, it is possible to train an artificial neural network model in a common deep learning framework such as Keras and quantize the model weights for a simplistic, direct translation onto the Loihi hardware. The radar imagery analyzed included a seven-vehicle class target set, which was processed at a rate of 9.5 images per second and with an overall accuracy of 90.1%. The optical data included a binary (two classes), and another nine-class data set. The binary classifier processed the optical data at a rate of 12.8 images per second with 94.0% accuracy. The nine classes optical data was processed at a rate 12.9 images per second and 79.7% accuracy. Lastly, the system used ~6 Watts of total power with ~0.6 Watts being utilized by the neuromorphic cores. The inferencing energy used to classify each image varied between 14.9 and 63.2 millijoules/image.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115428935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Lumsdaine, Luke Dalessandro, Kevin Deweese, J. Firoz, Scott McMillan
{"title":"Triangle Counting with Cyclic Distributions","authors":"A. Lumsdaine, Luke Dalessandro, Kevin Deweese, J. Firoz, Scott McMillan","doi":"10.1109/HPEC43674.2020.9286220","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286220","url":null,"abstract":"Triangles are the simplest non-trivial subgraphs and triangle counting is used in a number of different applications. The order in which vertices are processed in triangle counting strongly effects the amount of work that needs to be done (and thus the overall performance). Ordering vertices by degree has been shown to be one particularly effective ordering approach. However, for graphs with skewed degree distributions (such as power-law graphs), ordering by degree effects the distribution of work; parallelization must account for this distribution in order to balance work among workers. In this paper we provide an in-depth analysis of the ramifications of degree-based ordering on parallel triangle counting. We present approach for partitioning work in triangle counting, based on cyclic distribution and some surprisingly simple C++ implementations. Experimental results demonstrate the effectiveness of our approach, particularly for power-law (and social network) graphs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127092211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting GPU Direct Access to Non-Volatile Memory to Accelerate Big Data Processing","authors":"Mahsa Bayati, M. Leeser, N. Mi","doi":"10.1109/HPEC43674.2020.9286174","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286174","url":null,"abstract":"The amount of data being collected for analysis is growing at an exponential rate. Along with this growth comes increasing necessity for computation and storage. Researchers are addressing these needs by building heterogeneous clusters with CPUs and computational accelerators such as GPUs equipped with high I/O bandwidth storage devices. One of the main bottlenecks of such heterogeneous systems is the data transfer bandwidth to GPUs when running I/O intensive applications. The traditional approach gets data from storage to the host memory and then transfers it to the GPU, which can limit data throughput and processing and thus degrade the end-to-end performance. In this paper, we propose a new framework to address the above issue by exploiting Peer-to-Peer Direct Memory Access to allow GPU direct access of the storage device and thus enhance the performance for parallel data processing applications in a heterogeneous big-data platform. Our heterogeneous cluster is supplied with CPUs and GPUs as computing resources and Non-Volatile Memory express (NVMe) drives as storage resources. We deploy an Apache Spark platform to execute representative data processing workloads over this heterogeneous cluster and then adopt Peer-to-Peer Direct Memory Access to connect GPUs to non-volatile storage directly to optimize the GPU data access. Experimental results reveal that this heterogeneous Spark platform successfully bypasses the host memory and enables GPUs to communicate directly to the NVMe drive, thus achieving higher data transfer throughput and improving both data communication time and end-to-end nerformance by 20%.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Parallel Simulation for ACAS X Development","authors":"A. Gjersvik","doi":"10.1109/HPEC43674.2020.9286197","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286197","url":null,"abstract":"ACAS X is the next generation airborne collision avoidance system intended to meet the demands of the rapidly evolving U.S. National Airspace System (NAS). The collision avoidance safety and operational suitability of the system are optimized and continuously evaluated by simulating billions of characteristic aircraft encounters in a fast-time Monte Carlo environment. There is therefore an inherent computational cost associated with each ACAS X design iteration and parallelization of the simulations is necessary to keep up with rapid design cycles. This work describes an effort to profile and enhance the parallel computing infrastructure deployed on the computing resources offered by the Lincoln Laboratory Supercomputing Center. The approach to large-scale parallelization of our fast-time airspace encounter simulation tool is presented along with corresponding parallel profile data collected on different kinds of compute nodes. A simple stochastic model for distributed simulation is also presented to inform optimal work batching for improved simulation efficiency. The paper concludes with a discussion on how this high-performance parallel simulation method enables the rapid safety-critical design of ACAS X in a fast-paced iterative design process.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126020577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Myers, Daniel M. Dunlavy, K. Teranishi, D. Hollman
{"title":"Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software","authors":"J. Myers, Daniel M. Dunlavy, K. Teranishi, D. Hollman","doi":"10.1109/HPEC43674.2020.9286210","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286210","url":null,"abstract":"Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128496517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Langerman, A. Johnson, Kyle Buettner, A. George
{"title":"Beyond Floating-Point Ops: CNN Performance Prediction with Critical Datapath Length","authors":"David Langerman, A. Johnson, Kyle Buettner, A. George","doi":"10.1109/HPEC43674.2020.9286182","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286182","url":null,"abstract":"We propose Critical Datapath Length (CDL), a powerful, interpretable metric of neural-network models that enables accurate execution time prediction on parallel device architectures. CDL addresses the fact that the total number of floating-point operations (FLOPs) in a model is an inconsistent predictor of real execution time due to the highly parallel nature of tensor operations and hardware accelerators. Our results show that, on GPUs, CDL correlates to execution time significantly better than FLOPs, making it a useful performance predictor.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128978213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale Sparse Tensor Decomposition Using a Damped Gauss-Newton Method","authors":"Teresa M. Ranadive, M. Baskaran","doi":"10.1109/HPEC43674.2020.9286202","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286202","url":null,"abstract":"CANDECOMP/PARAFAC (CP) tensor decomposition is a popular unsupervised machine learning method with numerous applications. This process involves modeling a high-dimensional, multi-modal array (a tensor) as the sum of several low-dimensional components. In order to decompose a tensor, one must solve an optimization problem, whose objective is often given by the sum of the squares of the tensor and decomposition model entry differences. One algorithm occasionally utilized to solve such problems is CP-OPT-DGN, a damped Gauss-Newton all-at-once optimization method for CP tensor decomposition. However, there are currently no published results that consider the decomposition of large-scale (with up to billions of non-zeros), sparse tensors using this algorithm. This work considers the decomposition of large-scale tensors using an efficiently implemented CP-OPT-DGN method. It is observed that CP-OPT-DGN significantly outperforms CP-ALS (CP-Alternating Least Squares) and CP-OPT-QNR (a quasi-Newton-Raphson all-at-once optimization method for CP tensor decomposition), two other widely used tensor decomposition algorithms, in terms of accuracy and latent behavior detection.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using RAPIDS AI to Accelerate Graph Data Science Workflows","authors":"Todd Hricik, David A. Bader, Oded Green","doi":"10.1109/HPEC43674.2020.9286224","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286224","url":null,"abstract":"Scale free networks are abundant in many natural, social, and engineering phenomena for which there exists a substantial corpus of theory able to elucidate many of their underlying properties. In this paper we study the scalability of some widely available Python-based tools for the empirical investigation of scale free network data in a typical early stage analysis pipeline. We demonstrate how porting serial implementations of commonly used pipeline data structures and methods to parallel hardware via the NVIDIA RAPIDS AI API requires minimal rewriting of code. As a utility for each pipeline we recorded the time required to complete the analysis for both the serial and parallelized workflows on a task-wise basis. Furthermore, we review a statistically based methodology for fitting a power-law to empirical data. Maximum likelihood estimations for scale were inferred after using Kolmogorov-Smirnov based methods to determine location estimates. Our serial implementation of a typical early stage network analysis workflow uses a combination of widely used data structures and algorithms provided by the NumPy, Pandas and NetworkX frameworks. We then parallelized our workflow using the APIs provided by NVIDIA's RAPIDS AI open data science libraries and measured the relative time to completion for the tasks of ingesting raw data, creating a graph representation of the data and finally fitting a power-law distribution to the empirical observations. The results of our experiments, run on graphs ranging in size from 1 million to 20 million edges, demonstrate that significantly less time is required to complete the tasks of generating a graph from an edge list, computing the degree of all nodes in the graph and fitting the scale and location parameters to the observed data.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114090785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Sathre, Atharva Gondhalekar, Mohamed W. Hassan, W. Feng
{"title":"MetaCL: Automated “Meta” OpenCL Code Generation for High-Level Synthesis on FPGA","authors":"P. Sathre, Atharva Gondhalekar, Mohamed W. Hassan, W. Feng","doi":"10.1109/HPEC43674.2020.9286198","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286198","url":null,"abstract":"Traditionally, FPGA programming has been done via a hardware description language (HDL). An HDL provides fine-grained control over reconfigurable hardware but with limited productivity due to a steep learning curve and tedious design cycle. Thus, high-level synthesis (HLS) approaches have been a significant boon to productivity, and in recent years, OpenCL has emerged as a vendor-agnostic HLS language that offers the added benefit of interoperation with other OpenCL platforms (e.g., CPU, GPU, DSP) and existing OpenCL software. However, OpenCL's productivity can also suffer from tedious boilerplate code and the need to manually coordinate the host (i.e., CPU) and device (i.e., FPGA or other device). So, we present MetaCL, a compiler-assisted interface that takes OpenCL kernel functions as input and automatically generates OpenCL host-side code as output. MetaCL produces more efficient and readable host-side code, ensures portability, and introduces minimal additional runtime overhead compared to unassisted Openf.L development.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114325231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerator Design and Performance Modeling for Homomorphic Encrypted CNN Inference","authors":"Tian Ye, R. Kannan, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286219","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286219","url":null,"abstract":"The rapid advent of cloud computing has brought with it concerns on data security and privacy. Fully Homomorphic Encryption (FHE) is a technique for enabling data security that allows arbitrary computations to be performed directly on encrypted data. In particular, FHE can be used with convolutional neural networks (CNN) to perform inference as a service on homomorphic encrypted input data. However, the high computational demands of FHE inference require a careful understanding of the tradeoffs between various parameters such as security level, hardware resources and performance. In this paper, we propose a parameterized accelerator for homomorphic encrypted CNN inference. We first develop parallel algorithms to implement CNN operations via FHE primitives. We then develop a parameterized model to evaluate the performance of our CNN design. The model accepts inputs in terms of available hardware resources and security parameters and outputs performance estimates. As an illustration, for a typical image classification task on CIFAR-10 dataset with a seven-layer CNN model, we show that a batch of 4K encrypted images can be classified within 1 second on a device operating at 2 GHz clock rate with 16K MACs, 64 MB on-chip memory and 256 GB/s external memory bandwidth.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121546914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}