Mark D. Barnell, Courtney Raymond, Daniel Brown, Matthew Wilson, Éric Côté
{"title":"Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic Hardware","authors":"Mark D. Barnell, Courtney Raymond, Daniel Brown, Matthew Wilson, Éric Côté","doi":"10.1109/HPEC.2019.8916311","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916311","url":null,"abstract":"For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second, the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123102547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ezick, Ben Parsons, W. Glodek, Thomas Henretty, M. Baskaran, R. Lethin, J. Feo, Tai-Ching Tuan, Christopher J. Coley, Leslie Leonard, R. Agrawal
{"title":"Combining Tensor Decompositions and Graph Analytics to Provide Cyber Situational Awareness at HPC Scale","authors":"J. Ezick, Ben Parsons, W. Glodek, Thomas Henretty, M. Baskaran, R. Lethin, J. Feo, Tai-Ching Tuan, Christopher J. Coley, Leslie Leonard, R. Agrawal","doi":"10.1109/HPEC.2019.8916559","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916559","url":null,"abstract":"This paper describes MADHAT (Multidimensional Anomaly Detection fusing HPC, Analytics, and Tensors), an integrated workflow that demonstrates the applicability of HPC resources to the problem of maintaining cyber situational awareness. MADHAT combines two high-performance packages: ENSIGN for large-scale sparse tensor decompositions and HAGGLE for graph analytics. Tensor decompositions isolate coherent patterns of network behavior in ways that common clustering methods based on distance metrics cannot. Parallelized graph analysis then uses directed queries on a representation that combines the elements of identified patterns with other available information (such as additional log fields, domain knowledge, network topology, whitelists and blacklists, prior feedback, and published alerts) to confirm or reject a threat hypothesis, collect context, and raise alerts. MADHAT was developed using the collaborative HPC Architecture for Cyber Situational Awareness (HACSAW) research environment and evaluated on structured network sensor logs collected from Defense Research and Engineering Network (DREN) sites using HPC resources at the U.S. Army Engineer Research and Development Center DoD Supercomputing Resource Center (ERDC DSRC). To date, MADHAT has analyzed logs with over 650 million entries.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123546394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prototype Container-Based Platform for Extreme Quantum Computing Algorithm Development","authors":"P. Dreher, Madhuvanti Ramasami","doi":"10.1109/HPEC.2019.8916430","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916430","url":null,"abstract":"Recent advances in the development of the first generation of quantum computing devices have provided researchers with computational platforms to explore new ideas and reformulate conventional computational codes suitable for a quantum computer. Developers can now implement these reformulations on both quantum simulators and hardware platforms through a cloud computing software environment. For example, the IBM Q Experience provides the direct access to their quantum simulators and quantum computing hardware platforms. However these current access options may not be an optimal environment for developers needing to download and modify the source codes and libraries. This paper focuses on the construction of a Docker container environment with Qiskit source codes and libraries running on a local cloud computing system that can directly access the IBM Q Experience. This prototype container based system allows single user and small project groups to do rapid prototype development, testing and implementation of extreme capability algorithms with more agility and flexibility than can be provided through the IBM Q Experience website. This prototype environment also provides an excellent teaching environment for labs and project assignments within graduate courses in cloud computing and quantum computing. The paper also discusses computer security challenges for expanding this prototype container system to larger groups of quantum computing researchers.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125287470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carl Pearson, M. Almasri, Omer Anjum, Vikram Sharma Mailthody, Zaid Qureshi, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu
{"title":"Update on Triangle Counting on GPU","authors":"Carl Pearson, M. Almasri, Omer Anjum, Vikram Sharma Mailthody, Zaid Qureshi, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1109/HPEC.2019.8916547","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916547","url":null,"abstract":"This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hints to solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs without requiring a graph analysis phase.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122662617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A GPU Implementation of the Sparse Deep Neural Network Graph Challenge","authors":"M. Bisson, M. Fatica","doi":"10.1109/HPEC.2019.8916223","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916223","url":null,"abstract":"This paper presents a CUDA implementation of the latest addition to the Graph Challenge, the inference computation on a collection of large sparse deep neural networks. A single Tesla V100 can compute the inference at 3.7 TeraEdges/s. Using the managed memory API available in CUDA allows for simple and efficient distribution of these computations across a multiGPU NVIDIA DGX-2 server.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124750054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data","authors":"A. Cabrera, R. Chamberlain, J. Beard","doi":"10.1109/HPEC.2019.8916398","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916398","url":null,"abstract":"The problem of efficiently feeding processing elements and finding ways to reduce data movement is pervasive in computing. Efficient modeling of both temporal and spatial locality of memory references is invaluable in identifying superfluous data movement in a given application. To this end, we present a new way to infer both spatial and temporal locality using reuse distance analysis. This is accomplished by performing reuse distance analysis at different data block granularities: specifically, 64B, 4KiB, and 2MiB sizes. This process of simultaneously observing reuse distance with multiple granularities is called multi-spectral reuse distance. This approach allows for a qualitative analysis of spatial locality, through observing the shifting of mass in an application’s reuse signature at different granularities. Furthermore, the shift of mass is empirically measured by calculating the Earth Mover’s Distance between reuse signatures of an application. From the characterization, it is possible to determine how spatially dense the memory references of an application are based on the degree to which the mass has shifted (or not shifted) and how close (or far) the Earth Mover’s Distance is to zero as the data block granularity is increased. It is also possible to determine an appropriate page size from this information, and whether or not a given page is being fully utilized. From the applications profiled, it is observed that not all applications will benefit from having a larger page size. Additionally, larger data block granularities subsuming smaller ones suggest that larger pages will allow for more spatial locality exploitation, but examining the memory footprint will show whether those larger pages are fully utilized or not.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"118 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126419628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna
{"title":"IP Cores for Graph Kernels on FPGAs","authors":"S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna","doi":"10.1109/HPEC.2019.8916363","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916363","url":null,"abstract":"Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128035941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Farah I. Kandah, Brennan Huber, Amani Altarawneh, Sai Medury, A. Skjellum
{"title":"BLAST: Blockchain-based Trust Management in Smart Cities and Connected Vehicles Setup","authors":"Farah I. Kandah, Brennan Huber, Amani Altarawneh, Sai Medury, A. Skjellum","doi":"10.1109/HPEC.2019.8916229","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916229","url":null,"abstract":"Advancement in communication technologies and the Internet of Things (IoT) is driving smart cities adoption that aims to increase operational efficiency of infrastructure, improve the quality of services, and citizen welfare, among other worthy goals. For instance, it is estimated that by 2020, 75% of cars shipped globally will be equipped with hardware to facilitate vehicle connectivity. The privacy, reliability, and integrity of communication must be ensured so that actions can be accurate and implemented promptly after receiving actionable information. Because vehicles are equipped with the ability to compute, communicate, and sense their environment, there is a concomitant critical need to create and maintain trust among network entities in the context of the network’s dynamism, an issue that requires building and validating the trust between entities in a small amount of time before entities leave each other’s range. In this work, we present a multi-tier scheme consisting of an authentication- and trust-building/distribution framework designed with blockchain technology to ensure the safety and validity of the information exchanged in the system. Through simulation, we illustrate the tradeoff between blockchain mining time and the number of blocks being generated as well as the effect of the vehicle speed on the number of blocks being generated.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122018050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System","authors":"Rebecca Wild, M. Hubbell, J. Kepner","doi":"10.1109/HPEC.2019.8916493","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916493","url":null,"abstract":"Monitoring and managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117178153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philip Colangelo, Oren Segal, Alexander Speicher, M. Margala
{"title":"Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms","authors":"Philip Colangelo, Oren Segal, Alexander Speicher, M. Margala","doi":"10.1109/HPEC.2019.8916533","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916533","url":null,"abstract":"Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise. Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable 2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning based solutions for a wide range of applications having different system constraints.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124462108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}