{"title":"Leveraging Linear Algebra to Count and Enumerate Simple Subgraphs","authors":"Vitaliy Gleyzer, Andrew J. Soszynski, E. Kao","doi":"10.1109/HPEC43674.2020.9286191","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286191","url":null,"abstract":"Even though subgraph counting and subgraph matching are well-known NP-Hard problems, they are foundational building blocks for many scientific and commercial applications. In order to analyze graphs that contain millions to billions of edges, distributed systems can provide computational scalability through search parallelization. One recent approach for exposing graph algorithm parallelization is through a linear algebra formulation and the use of the matrix multiply operation, which conceptually is equivalent to a massively parallel graph traversal. This approach has several benefits, including 1) a mathematically-rigorous foundation, and 2) ability to leverage specialized linear algebra accelerators and high-performance libraries. In this paper we explore and define a linear algebra methodology for performing exact subgraph counting and matching for 4-vertex subgraphs excluding the clique. Matches on these simple subgraphs can be joined as components for a larger subgraph. With thorough analysis we demonstrate that the linear algebra formulation leverages path aggregation which allows it to be up 2x to 5x more efficient in traversing the search space and compressing the results as compared to tree-based subgraph matching techniques.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128024059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy","authors":"Jingbo Hu, Guohao Dai, Yu Wang, Huazhong Yang","doi":"10.1109/HPEC43674.2020.9286173","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286173","url":null,"abstract":"Large-scale graphs play a vital role in various applications, but it is limited by the long processing time. Graph sampling is an effective way to reduce the amount of graph data and accelerate the algorithm. However, previous work usually lacks theoretical analysis related to graph algorithm models. In this study, GraphSDH (Graph Sampling with Distribution and Hierarchy), a general large-scale graph sampling framework is established based on the vertex-centric graph model. According to four common sampling techniques, we derive the sampling probability to minimize the variance, and optimize the design according to whether there is a pre-estimation process for the intermediate value. In order to further improve the accuracy of the graph algorithm, we propose a stratified sampling method based on vertex degree and a hierarchical optimization scheme based on sampling position analysis. Extensive experiments on large graphs show that GraphSDH can achieve over 95% accuracy for PageRank by sampling only 10% edges of the original graph, and speed up PageRank by several times than that of the non-sampling case. Compared with random neighbor sampling, GraphSDH can reduce the mean relative error of PageRank by about 17% at a sampling neighbor ratio (sampling fraction) of 20%. Furthermore, GraphSDH can be applied to various graph algorithms, such as Breadth-First Search (BFS), Alternating Least Squares (ALS) and Label Propagation Algorithm (LPA).","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125821806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wu-chun Feng, Da Zhang, Jing Zhang, Kaixi Hou, S. Pumma, Hao Wang
{"title":"A Feasibility Study for MPI over HDFS","authors":"Wu-chun Feng, Da Zhang, Jing Zhang, Kaixi Hou, S. Pumma, Hao Wang","doi":"10.1109/HPEC43674.2020.9286250","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286250","url":null,"abstract":"With the increasing prominence of integrating highperformance computing (HPC) with big-data (BIGDATA) processing, running MPI over the Hadoop Distributed File System (HDFS) offers a promising approach for delivering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) two-sided MPI communication to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default HDFS block-placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. So, while directly integrating MPI with HDFS may deliver better scalability and fault tolerance to MPI applications, it will fall short of delivering competitive performance. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124315658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carl L. Colena, Michael J. Russell, Stephen A. Braun
{"title":"Minesweeper: A Novel and Fast Ordered-Statistic CFAR Algorithm","authors":"Carl L. Colena, Michael J. Russell, Stephen A. Braun","doi":"10.1109/HPEC43674.2020.9286140","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286140","url":null,"abstract":"A novel algorithm named ‘Minesweeper’ was developed for computing the Ordered Statistic Constant False Alarm Rate (CFAR) in a computationally efficient and novel way. OS-CFAR processing chains are used in radar applications for noise-floor estimation and target discrimination. Unlike other approaches, this algorithm aims to minimize data reuse by using training cell geometry and an accumulation matrix to compute the noise estimate. Computing the OS-CFAR in this manner affords some unique efficiencies that are novel for this application. This includes runtime invariance by bit-depth of the input data and by training geometry. Three implementations of Minesweeper were developed and benchmarked. The Optimized GPU Implementation (GPU-OPT) performed the best in both throughput and latency for large inputs. This algorithm has potential for use in real-time GPU-accelerated SDR applications.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122832162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyu Fu, Ziqian Wan, Guohao Dai, Yu Wang, Huazhong Yang
{"title":"LessMine: Reducing Sample Space and Data Access for Dense Pattern Mining","authors":"Tianyu Fu, Ziqian Wan, Guohao Dai, Yu Wang, Huazhong Yang","doi":"10.1109/HPEC43674.2020.9286187","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286187","url":null,"abstract":"In the era of “big data”, graph has been proven to be one of the most important reflections of real-world problems. To refine the core properties of large-scale graphs, dense pattern mining plays a significant role. Because of the complexity of pattern mining problems, conventional implementations often lack scalability, consuming much time and memory space. Previous work (e.g., ASAP [1]) proposed approximate pattern mining as an efficient way to extract structural information from graphs. It demonstrates dramatic performance improvement by up to two orders of magnitude. However, we observe three main flaws of ASAP in cases of dense patterns, thus we propose LessMine, which reduces the sample space and data access for dense pattern mining. We introduce the reorganization of data structure, the method of concurrent sample, and uniform close. We also provide locality-aware partition for distributed settings. The evaluation shows that our design achieves up to 1829 × speedup with 66% less error rate compared with ASAP.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121502948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Baskaran, Charles Jin, Benoît Meister, J. Springer
{"title":"Automatic Mapping and Optimization to Kokkos with Polyhedral Compilation","authors":"M. Baskaran, Charles Jin, Benoît Meister, J. Springer","doi":"10.1109/HPEC43674.2020.9286233","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286233","url":null,"abstract":"In the post-Moore's Law era, the quest for exascale computing has resulted in diverse hardware architecture trends, including novel custom and/or specialized processors to accelerate the systems, asynchronous or self-timed computing cores, and near-memory computing architectures. To contend with such heterogeneous and complex hardware targets, there have been advanced software solutions in the form of new programming models and runtimes. However, using these advanced programming models poses productivity and performance portability challenges. This work takes a significant step towards addressing the performance, productivity, and performance portability challenges faced by the high-performance computing and exascale community. We present an automatic mapping and optimization framework that takes sequential code and automatically generates high-performance parallel code in Kokkos, a performance portable parallel programming model targeted for exascale computing. We demonstrate the productivity and performance benefits of optimized mapping to Kokkos using kernels from a critical application project on climate modeling, the Energy Exascale Earth System Model (E3SM) project. This work thus shows that automatic generation of Kokkos code enhances the productivity of application developers and enables them to fully utilize the benefits of a programming model such as Kokkos.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131754124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pouya Haghi, Anqi Guo, Qingqing Xiong, Rushi Patel, Chen Yang, Tong Geng, Justin T. Broaddus, Ryan J. Marshall, A. Skjellum, M. Herbordt
{"title":"FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives","authors":"Pouya Haghi, Anqi Guo, Qingqing Xiong, Rushi Patel, Chen Yang, Tong Geng, Justin T. Broaddus, Ryan J. Marshall, A. Skjellum, M. Herbordt","doi":"10.1109/HPEC43674.2020.9286200","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286200","url":null,"abstract":"MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI annlications.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128314797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Ehret, Eliakin Del Rosario, K. Gettings, M. Kinsy
{"title":"A Hardware Root-of-Trust Design for Low-Power SoC Edge Devices","authors":"Alan Ehret, Eliakin Del Rosario, K. Gettings, M. Kinsy","doi":"10.1109/HPEC43674.2020.9286164","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286164","url":null,"abstract":"In this work, we introduce a hardware root-of-trust architecture for low-power edge devices. An accelerator-based SoC design that includes the hardware root-of-trust architecture is developed. An example application for the device is presented. We examine attacks based on physical access given the significant threat they pose to unattended edge systems. The hardware root-of-trust provides security features to ensure the integrity of the SoC execution environment when deployed in uncontrolled, unattended locations. E-fused boot memory ensures the boot code and other security critical software is not compromised after deployment. Digitally signed programmable instruction memory prevents execution of code from untrusted sources. A programmable finite state machine is used to enforce access policies to device resources even if the application software on the device is compromised. Access policies isolate the execution states of application and security-critical software. The hardware root-of-trust architecture saves energy with a lower hardware overhead than a separate secure enclave while eliminating software attack surfaces for access control policies.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114766042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KTRussExPLORER: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs","authors":"Safaa Diab, Mhd Ghaith Olabi, I. E. Hajj","doi":"10.1109/HPEC43674.2020.9286165","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286165","url":null,"abstract":"K-truss decomposition is an important method in graph analytics for finding cohesive subgraphs in a graph. Various works have accelerated k-truss decomposition on GPUs and have proposed different optimizations while doing so. The combinations of these optimizations form a large design space. However, most GPU implementations focus on a specific combination or set of combinations in this space. This paper surveys the optimizations applied to k-truss decomposition on GPUs, and presents KTRussExPLORER, a framework for exploring the design space formed by the combinations of these optimizations. Our evaluation shows that the best combination highly depends on the graph of choice, and analyses the conditions that make each optimization attractive. Some of the best combinations we find outperform previous Graph Challenge champions on many large graphs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128583429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Márton Elekes, A. Nagy, Dávid Sándor, János Benjamin Antal, Tim Davis, Gábor Szárnyas
{"title":"A GraphBLAS solution to the SIGMOD 2014 Programming Contest using multi-source BFS","authors":"Márton Elekes, A. Nagy, Dávid Sándor, János Benjamin Antal, Tim Davis, Gábor Szárnyas","doi":"10.1109/HPEC43674.2020.9286186","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286186","url":null,"abstract":"The GraphBLAS standard defines a set of fundamental building blocks for formulating graph algorithms in the language of linear algebra. Since its first release in 2017, the expressivity of the GraphBLAS API and the performance of its implementations (such as SuiteSparse: GraphBLAS) have been studied on a number of textbook graph algorithms such as BFS, single-source shortest path, and connected components. However, less attention was devoted to other aspects of graph processing such as handling typed and attributed graphs (also known as property graphs), and making use of complex graph query techniques (handling paths, aggregation, and filtering). To study these problems in more detail, we have used GraphBLAS to solve the case study of the 2014 SIGMOD Programming Contest, which defines complex graph processing tasks that require a diverse set of operations. Our solution makes heavy use of multi-source BFS algorithms expressed as sparse matrix-matrix multiplications along with other GraphBLAS techniques such as masking and submatrix extraction. While the queries can be formulated in GraphBLAS concisely, our performance evaluation shows mixed results. For some queries and data sets, the performance is competitive with the hand-optimized top solutions submitted to the contest, however, in some cases, it is currently outperformed by orders of magnitude.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134201920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}