R. Pearce, Trevor Steil, Benjamin W. Priest, G. Sanders
{"title":"One Quadrillion Triangles Queried on One Million Processors","authors":"R. Pearce, Trevor Steil, Benjamin W. Priest, G. Sanders","doi":"10.1109/HPEC.2019.8916243","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916243","url":null,"abstract":"We update our prior 2017 Graph Challenge submission [7] on large scale triangle counting in distributed memory by demonstrating scaling and validation on trillion-edge scale-free graphs. We incorporate recent distributed communication optimizations developed for irregular communication workloads [1], and demonstrate scaling up to 1.5 million cores of IBM BG/Q Sequoia at LLNL. We validate our implementation using nonstochastic Kronecker graph generation where ground-truth local and global triangle counts are known, and model our Kronecker graph inputs after the Graph500 [5] R-MAT inputs. To our knowledge, our results are the largest triangle count experiments on synthetic scale-free graphs to date.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"130 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128714521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"C to D-Wave: A High-level C Compilation Framework for Quantum Annealers","authors":"Mohamed W. Hassan, S. Pakin, Wu-chun Feng","doi":"10.1109/HPEC.2019.8916231","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916231","url":null,"abstract":"A quantum annealer solves optimization problems by exploiting quantum effects. Problems are represented as Hamiltonian functions that define an energy landscape. The quantum-annealing hardware relaxes to a solution corresponding to the ground state of the energy landscape. Expressing arbitrary programming problems in terms of real-valued Hamiltonian-function coefficients is unintuitive and challenging. This paper addresses the difficulty of programming quantum annealers by presenting a compilation framework that compiles a subset of C code to a quantum machine instruction (QMI) to be executed on a quantum annealer. Our work is based on a modular software stack that facilitates programming D-Wave quantum annealers by successively lowering code from C to Verilog to a symbolic “quantum macro assembly language” and finally to a device-specific Hamiltonian function. We demonstrate the capabilities of our software stack on a set of problems written in C and executed on a D-Wave 2000Q quantum annealer.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"277 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123432144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Bobda, Taylor J. L. Whitaker, Joel Mandebi Mbongue, S. Saha
{"title":"Synthesis of Hardware Sandboxes for Trojan Mitigation in Systems on Chip","authors":"C. Bobda, Taylor J. L. Whitaker, Joel Mandebi Mbongue, S. Saha","doi":"10.1109/HPEC.2019.8916526","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916526","url":null,"abstract":"In this work, we propose a high-level synthesis approach for hardware sandboxes in system-on-chip. Using interface formalism to capture interactions between non-trusted IPs and trusted parts of a system on chip, along with the properties specification language to specify non-authorized actions of non-trusted IPs, sandboxes are generated and made ready for inclusion as IP in a system-on-chip design. The concepts of composition, compatibility, and refinement are used to capture illegal actions and optimize resources across the boundary of single IPs. We have designed a tool that automatically generates the sandbox and facilitates their integration into system-on-chip. Our approach was validated with benchmarks from trust-hub.com and FPGA implementations. All our results showed 100% Trojan detection and mitigation, with only a minimal increase in resource overhead and no performance decrease.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122833034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Almasri, Omer Anjum, Carl Pearson, Zaid Qureshi, Vikram Sharma Mailthody, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu
{"title":"Update on k-truss Decomposition on GPU","authors":"M. Almasri, Omer Anjum, Carl Pearson, Zaid Qureshi, Vikram Sharma Mailthody, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1109/HPEC.2019.8916285","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916285","url":null,"abstract":"In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge 2018. For single k k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different ‘k’ value. Compared to our prior multi-GPU implementation, the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the ‘k’ value in each iteration is inefficient particularly for graphs with large maximum k-truss. Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The binary search approach on a single GPU is up to 101.5 (24.3x on average) faster than our 2018 k-truss submission. Lastly, we show that the proposed binary search finds the maximum k-truss for “Twitter“ graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131927312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sayan Ghosh, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman
{"title":"Scaling and Quality of Modularity Optimization Methods for Graph Clustering","authors":"Sayan Ghosh, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman","doi":"10.1109/HPEC.2019.8916299","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916299","url":null,"abstract":"Real-world graphs exhibit structures known as “communities” or “clusters” consisting of a group of vertices with relatively high connectivity between them, as compared to the rest of the vertices in the network. Graph clustering or community detection is a fundamental graph operation used to analyze real-world graphs occurring in the areas of computational biology, cybersecurity, electrical grids, etc. Similar to other graph algorithms, owing to irregular memory accesses and inherently sequential nature, current algorithms for community detection are challenging to parallelize. However, in order to analyze large networks, it is important to develop scalable parallel implementations of graph clustering that are capable of exploiting the architectural features of modern supercomputers.In response to the 2019 Streaming Graph Challenge, we present quality and performance analysis of our distributed-memory community detection using Vite, which is our distributed memory implementation of the popular Louvain method, on the ALCF Theta supercomputer.Clustering methods such as Louvain that rely on modularity maximization are known to suffer from the resolution limit problem, preventing identification of clusters of certain sizes. Hence, we also include quality analysis of our shared-memory implementation of the Fast-tracking Resistance method, in comparison with Louvain on the challenge datasets.Furthermore, we introduce an edge-balanced graph distribution for our distributed memory implementation, that significantly reduces communication, offering up to 80% improvement in the overall execution time. In addition to performance/quality analysis, we also include details on the power/energy consumption, and memory traffic of the distributed-memory clustering implementation using real-world graphs with over a billion edges.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"347 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124288977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Many-target, Many-sensor Ship Tracking and Classification","authors":"Leonard Kosta, John Irvine, Laura Seaman, H. Xi","doi":"10.1109/HPEC.2019.8916332","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916332","url":null,"abstract":"Government agencies such as DARPA wish to know the numbers, locations, tracks, and types of vessels moving through strategically important regions of the ocean. We implement a multiple hypothesis testing algorithm to simultaneously track dozens of ships with longitude and latitude data from many sensors, then use a combination of behavioral fingerprinting and deep learning techniques to classify each vessel by type. The number of targets is unknown a priori. We achieve both high track purity and high classification accuracy on several datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122510712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Louis Jenkins, J. Firoz, Marcin Zalewski, C. Joslyn, Mark Raugas
{"title":"Graph Algorithms in PGAS: Chapel and UPC++","authors":"Louis Jenkins, J. Firoz, Marcin Zalewski, C. Joslyn, Mark Raugas","doi":"10.1109/HPEC.2019.8916309","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916309","url":null,"abstract":"The Partitioned Global Address Space (PGAS) programming model can be implemented either with programming language features or with runtime library APIs, each implementation favoring different aspects (e.g., productivity, abstraction, flexibility, or performance). Certain language and runtime features, such as collectives, explicit and asynchronous communication primitives, and constructs facilitating overlap of communication and computation (such as futures and conjoined futures) can enable better performance and scaling for irregular applications, in particular for distributed graph analytics. We compare graph algorithms in one of each of these environments: the Chapel PGAS programming language and the the UPC++ PGAS runtime library. We implement algorithms for breadth-first search and triangle counting graph kernels in both environments. We discuss the code in each of the environments, and compile performance data on a Cray Aries and an Infiniband platform. Our results show that the library-based approach of UPC++ currently provides strong performance while Chapel provides a high-level abstraction that, harder to optimize, still provides comparable performance.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"297-301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130817903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Hardware Security Techniques Targeting Low-Power SoC Designs","authors":"Alan Ehret, K. Gettings, B. R. Jordan, M. Kinsy","doi":"10.1109/HPEC.2019.8916486","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916486","url":null,"abstract":"In this work, we survey hardware-based security techniques applicable to low-power system-on-chip designs. Techniques related to a system’s processing elements, volatile main memory and caches, non-volatile memory and on-chip interconnects are examined. Threat models for each subsystem and technique are considered. Performance overheads and other trade-offs for each technique are discussed. Defenses with similar threat models are compared.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122896317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Scalable Distributed Tensor Decompositions","authors":"M. Baskaran, Thomas Henretty, J. Ezick","doi":"10.1109/HPEC.2019.8916319","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916319","url":null,"abstract":"Tensor decomposition is a prominent technique for analyzing multi-attribute data and is being increasingly used for data analysis in different application areas. Tensor decomposition methods are computationally intense and often involve irregular memory accesses over large-scale sparse data. Hence it becomes critical to optimize the execution of such data intensive computations and associated data movement to reduce the eventual time-to-solution in data analysis applications. With the prevalence of using advanced high-performance computing (HPC) systems for data analysis applications, it is becoming increasingly important to provide fast and scalable implementation of tensor decompositions and execute them efficiently on modern and advanced HPC systems. In this paper, we present distributed tensor decomposition methods that achieve faster, memory-efficient, and communication-reduced execution on HPC systems. We demonstrate that our techniques reduce the overall communication and execution time of tensor decomposition methods when they are used for analyzing datasets of varied size from real application. We illustrate our results on HPE Superdome Flex server, a high-end modular system offering large-scale in-memory computing, and on a distributed cluster of Intel Xeon multi-core nodes.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128037102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}