{"title":"On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics","authors":"J. Firoz, Ang Li, Jiajia Li, K. Barker","doi":"10.1109/HPEC43674.2020.9286152","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286152","url":null,"abstract":"Today's data-driven analytics and machine learning workload have been largely driven by the General-Purpose Graphics Processing Units (GPGPUs). To accelerate dense matrix multiplications on the GPUs, Tensor Core Units (TCUs) have been introduced in recent years. In this paper, we study linear-algebra-based and vertex-centric algorithms for various graph kernels on the GPUs with an objective of applying this new hardware feature to graph applications. We identify the potential stages in these graph kernels that can be executed on the Tensor Core Units. In particular, we leverage the reformulation of the reduction and scan operations in terms of matrix multiplication [1] on the TCUs. We demonstrate that executing these operations on the TCUs, available inside different graph kernels, can assist in establishing an end-to-end pipeline on the GPGPUs without depending on hand-tuned external libraries and still can deliver comparable performance for various graph analytics.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130462157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Filip Pawlowski, R. Bisseling, B. Uçar, Albert-Jan N. Yzelman
{"title":"Combinatorial Tiling for Sparse Neural Networks","authors":"Filip Pawlowski, R. Bisseling, B. Uçar, Albert-Jan N. Yzelman","doi":"10.1109/HPEC43674.2020.9286154","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286154","url":null,"abstract":"Sparse deep neural networks (DNNs) emerged as the result of search for networks with less storage and lower computational complexity. The sparse DNN inference is the task of using such trained DNN networks to classify a batch of input data. We propose an efficient, hybrid model- and data-parallel DNN inference using hypergraph models and partitioners. We exploit tiling and weak synchronization to increase cache reuse, hide load imbalance, and hide synchronization costs. Finally, a blocking approach allows application of this new hybrid inference procedure for deep neural networks. We initially experiment using the hybrid tiled inference approach only, using the first five layers of networks from the IEEE HPEC 2019 Graph Challenge, and attain up to 2 x speedup versus a data-parallel baseline.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134401835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hash Table Scalability on Intel PIUMA","authors":"B. Seshasayee, J. Fryman, I. Hur","doi":"10.1109/HPEC43674.2020.9286204","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286204","url":null,"abstract":"The Intel PIUMA (Programmable and Integrated Unified Memory Architecture) is a scalable, massively multithreaded architecture designed to operate on unstructured data, with a global address space, fine-grain memory access and various novel features for latency hiding during data movement. Hash tables are a commonly used data structure with unstructured data, hence it is imperative that the performance and scaling for hash table usages are optimized for this architecture. We study three different hash table implementations on a PIUMA simulator to show that a dual-atomics based implementation, a unique feature in PIUMA, performs competitively both at larger scales and under hash collisions. Our implementations are able to achieve strong scaling up to 16,384 hardware threads.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115769728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruizhi Zhang, Sasindu Wijeratne, Yang Yang, S. Kuppannagari, V. Prasanna
{"title":"A High Throughput Parallel Hash Table on FPGA using XOR-based Memory","authors":"Ruizhi Zhang, Sasindu Wijeratne, Yang Yang, S. Kuppannagari, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286199","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286199","url":null,"abstract":"Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support $p$ parallel queries (p > 1) per clock cycle via $p$ processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design - FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3 x speedup.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl
{"title":"Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics","authors":"J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl","doi":"10.1109/HPEC43674.2020.9286239","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286239","url":null,"abstract":"The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"440 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Gadepally, Mihailo Isakov, R. Agrawal, J. Kepner, K. Gettings, M. Kinsy
{"title":"Homomorphic Encryption Based Secure Sensor Data Processing","authors":"V. Gadepally, Mihailo Isakov, R. Agrawal, J. Kepner, K. Gettings, M. Kinsy","doi":"10.1109/HPEC43674.2020.9286175","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286175","url":null,"abstract":"Novel sensor processing algorithms face many hurdles to their adoption. Sensor processing environments have become increasingly difficult with an ever increasing array of threats. These threats have, in turn, raised the bar on deploying new capabilities. Many novel sensor processing algorithms exploit or induce randomness to boost algorithm performance. Co-designing this randomness with cryptographic features could be a powerful combination providing both improved algorithm performance and increased resiliency. The emerging field of signal processing in the encrypted domain has begun to explore such approaches. The development of this new class of algorithms will require new classes of tools. In particular, the foundational linear algebraic mathematics will need to be enhanced with cryptographic concepts to allow researchers to explore this new domain. This work highlights a relatively low overhead method that uses homomorphic encryption to enhance the resiliency of a part of a larger sensor processing pipeline.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128564692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seunghwa Kang, Alexandre Fender, Joe Eaton, Brad Rees
{"title":"Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters","authors":"Seunghwa Kang, Alexandre Fender, Joe Eaton, Brad Rees","doi":"10.1109/HPEC43674.2020.9286216","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286216","url":null,"abstract":"PageRank is a widely used graph analytics algorithm to rank vertices using relationship data. Large-scale Page Rank is challenging due to its irregular and communication intensive computational characteristics. We implemented Page Rank on NVIDIA's newly released DGX A100 cluster and compared the performance with two recent notable large-scale Page Rank computations using the Common Crawl dataset. The ShenTu framework computed Page Rank scores using a large number of custom microprocessors connected with an HPC class interconnect. The Hronos framework reported the state-of-the-art performance using 3000 commodity CPU nodes and 10 Gbps Ethernet. The Common Crawl dataset captures link relationships between web pages in a graph with 3.563 billion vertices and 128.736 billion edges. Our implementation demonstrated 13x faster PageRank iteration time than the Hronos framework using a cluster with NVLink connected 32 A100 GPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"433 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132863338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of floating-point round-off error in linear algebra routines for graph clustering","authors":"L. Yang, Alyson Fox","doi":"10.1109/HPEC43674.2020.9286190","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286190","url":null,"abstract":"We explore the various ways rounding errors can impact the power method for calculating the Fielder vector for graph clustering. A rounding error analysis reveals that the best eigenpair that is computable with a certain floating point precision type has a worst-case error that scales to its unit round-off. Although rounding errors can accumulate in the power method at the worst-case bound, this behavior is not reflected in some practical examples. Furthermore, our numerical experiments show that rounding errors from the power method may satisfy the conditions necessary for the bounding of the mis-clustering rate and that the approximate eigenvectors with errors close to half precision unit round-off can yield sufficient clustering results for partitioning stochastic block model graphs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131342321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino
{"title":"Inference Benchmarking on HPC Systems","authors":"W. Brewer, G. Behm, A. Scheinine, Ben Parsons, Wesley Emeneker, Robert P. Trevino","doi":"10.1109/HPEC43674.2020.9286138","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286138","url":null,"abstract":"As deep learning on edge computing systems has become more prevalent, investigation of architectures and configurations for optimal inference performance has become a critical step for proposed artificial intelligence solutions. While there has been considerable work in the development of hardware and software for high performance inferencing, there is little known about the performance of such systems on HPC architectures. In this paper, we address outstanding questions on the parallel inference performance on HPC systems. We report results and recommendations derived from evaluating iBench on multiple platforms in a variety of HPC configurations. We systematically benchmark single-GPU performance, single-node performance, and multi-node performance for maximum client-side and server-side inference throughput. In order to achieve linear speedup, we show that concurrent sending clients must be used, as opposed to sending large batch payloads parallelized across multiple GPUs. We show that client/server inferencing architectures add a considerable data transfer component that needs to be taken into consideration when benchmarking HPC system that benchmarks such as MLPerf do not measure. Finally, we investigate energy efficiency of GPUs for different levels of concurrency and batch sizes to report optimal configurations that minimize cost per inference.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126794693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling","authors":"Roozbeh Karimi, David M. Koppelman, C. J. Michael","doi":"10.1109/HPEC43674.2020.9286141","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286141","url":null,"abstract":"Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU-CH, which significantly reduces the contraction time of small-to medium-sized graphs, reporting a speedup of over 20× on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS’ USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61 % faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40× and 12× faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia VI00 GPUs on most graphs tried.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121125854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}