{"title":"On the Characterization of the Performance-Productivity Gap for FPGA","authors":"Atharva Gondhalekar, Thomas Twomey, W. Feng","doi":"10.1109/HPEC55821.2022.9926404","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926404","url":null,"abstract":"Today, FPGA vendors provide a C++/C-based programming environment to enhance programmer productivity over using a hardware-description language at the register-transfer level. The common perception is that this enhanced pro-ductivity comes at the expense of significantly less performance, e.g., as much an order of magnitude worse. To characterize this performance-productivity tradeoff, we propose a new composite metric, II, that quantitatively captures the perceived discrepancy between the performance and productivity of any two given FPGA programming languages, e.g., Verilog vs. OpenCL. We then present the implications of our metric via a case study on the design of a Sobel filter (i.e., edge detector) using three different programming models - Verilog, OpenCL, oneAPI - on an Intel Arria 10 GX FPGA accelerator. Relative to performance, our results show that an optimized OpenCL kernel achieves 84% of the performance of an optimized Verilog version of the code on a 7680×4320 (8K) image. Conversely, relative to productivity, OpenCL offers a 6.1 x improvement in productivity over Verilog, while oneAPI improves the productivity by an additional factor of 1.25 x over OpenCL.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130060481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explicit Ordering Refinement for Accelerating Irregular Graph Analysis","authors":"Michael Mandulak, Ruochen Hu, George M. Slota","doi":"10.1109/HPEC55821.2022.9926340","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926340","url":null,"abstract":"Vertex reordering for efficient memory access in extreme-scale graph-based data analysis shows considerable improvement to the cache efficiency and runtimes of widely used graph analysis algorithms. Despite this, modern efficient ordering methods are often heuristic-based and do not directly optimize some given metrics. Thus, this paper conducts an experimental study into explicit metric-based vertex ordering optimization. We introduce a universal graph partitioning-inspired approach focusing on CPU shared-memory parallelism to the vertex ordering problem through the explicit refinement of low-degree vertices using the Linear Gap Arrangement and Log Gap Arrangement problems as comprehensive metrics for ordering improvement. This degree-based refinement method is evaluated upon a number of initial orderings with timing and cache efficiency results relative to three shared-memory graph analytic algorithms: PageRank, Louvain and the Multistep algorithm. Applying refinement, we observe runtime improvements of up to 15x on the ClueWeb09 graph and up to 4x improvements to cache efficiency on a variety of network types and initial orderings, demonstrating the feasibility of an optimization approach to the vertex ordering problem at a large scale.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"73 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114092450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank","authors":"Seunghwa Kang, Joseph Nke, Brad Rees","doi":"10.1109/HPEC55821.2022.9926341","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926341","url":null,"abstract":"We previously reported PageRank performance results on a cluster with 32 A100 GPUs [7]. This paper extends the previous work to 2048 GPUs. The previous implementation performs well as long as the number of G PU s is small relative to the square of the average vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph, https://github.com/rapidsai/cugraph). While we evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph. To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per Page Rank iteration on the 32 GPU cluster. Computing Page Rank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge weight, average vertex degree: 16) took 1.54 second per Page Rank iteration on the Selene supercomputer with 2048 GPUs. We conclude this paper discussing potential network system enhancements to improve the scaling.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121991086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicklaus Przybylski, William M. Jones, Nathan Debardeleben
{"title":"Online Detection and Classification of State Transitions of Multivariate Shock and Vibration Data","authors":"Nicklaus Przybylski, William M. Jones, Nathan Debardeleben","doi":"10.1109/HPEC55821.2022.9926361","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926361","url":null,"abstract":"The US Department of Energy's (DOE) Los Alamos National Laboratory (LANL) is interested in automatic anomaly detection and classification applied to highly instrumented flight shock and vibration data for the purpose of providing insight into operational safety. For example, the safe and secure transport of materials and devices during a variety of conditions is particularly of interest. In this work, we apply well-known Machine Learning (ML) techniques to a publicly available motor vibration data set that serves as a proxy to the actual LANL data. We successfully train a random forest to classify anomalous motor states using the signal data set, and use this model to simulate real-time anomaly detection and event classification on multi-variate time series data [1], [2]. Furthermore, we perform an extensive suite of computational studies on a large cluster computer to determine optimal parametric settings for our framework and evaluate the cost-benefit of these parameters.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129296737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Im2win: Memory Efficient Convolution On SIMD Architectures","authors":"Shuai-bing Lu, Jun Chu, X. Liu","doi":"10.1109/HPEC55821.2022.9926408","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926408","url":null,"abstract":"Convolution is the most expensive operation among neural network operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However, the im2col data transformation can lead to at least 2 x memory footprint compared to not using data transformation at all, thus limiting the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory-efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col, and achieves average to 3.6 × and 5.3× speedup in performance compared to the im2col-based convolution and not using data transformation, respectively.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117236517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving Speedups for Distributed Graph Biconnectivity","authors":"Ian Bogle, George M. Slota","doi":"10.1109/HPEC55821.2022.9926360","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926360","url":null,"abstract":"As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed memory architectures becomes increasingly important. We consider the problem of biconnectivity for this current study, which identifies cut vertices and cut edges in a graph. As part of our study, we implemented and optimized a shared memory biconnectivity algorithm based on color propagation within a distributed memory context. This algorithm is neither work nor time efficient. However, when we compare to distributed implementations of theoretically efficient algorithms, we find that simple non-optimal algorithms can greatly outperform time-efficient algorithms in practice when implemented for real distributed-memory environments and real data. Overall, our distributed implementation for computing graph biconnectivity demonstrates an average strong scaling speedup of 15 x across 64 MPI ranks on a suite of irregular real-world inputs. We also note an average of llx and 7.3x speedup relative to the optimal serial algorithm and fastest shared-memory implementation for the biconnectivity problem, respectively.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124363160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin
{"title":"Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores","authors":"Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC55821.2022.9926300","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926300","url":null,"abstract":"Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great op-portunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half-precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of final results. To the best of our knowl-edge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41x and 8.12x speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version are also up to 6.56 x faster over the 2021 champion running on 4 GPUs and 7.55x faster over the 2020 champion running on 768 GPUs.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HashTag: Fast Lookup in a Persistent Memory File System","authors":"Matthew Curtis-Maury, Yash Trivedi","doi":"10.1109/HPEC55821.2022.9926368","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926368","url":null,"abstract":"Persistent Memory (PM) offers byte-addressability and persistence on the memory bus, and delivers dramatic performance improvements over traditional storage media. While many file systems have been optimized for PM, a large fraction of processing time is generally spent locating the required data in PM due to the standard use of extent-trees for location indexing. This paper presents HashTag, a cache of PM locations for use in PM file systems with support for snapshot creation. We evaluate HashTag across a range of configurations to determine the impact of various location caching options on filesystem performance. These lessons can inform the design of future caching solutions in PM filesystems.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123586279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Fenelon, L. Gjesteby, Webster Guan, Juhyuk Park, Kwanghun Chung, L. Brattain
{"title":"A Scalable Inference Pipeline for 3D Axon Tracing Algorithms","authors":"Benjamin Fenelon, L. Gjesteby, Webster Guan, Juhyuk Park, Kwanghun Chung, L. Brattain","doi":"10.1109/HPEC55821.2022.9926403","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926403","url":null,"abstract":"High inference times of machine learning-based axon tracing algorithms pose a significant challenge to the practical analysis and interpretation of large-scale brain imagery. This paper explores a distributed data pipeline that employs a SLURM-based job array to run multiple machine learning algorithm predictions simultaneously. Image volumes were split into N (1–16) equal chunks that are each handled by a unique compute node and stitched back together into a single 3D prediction. Preliminary results comparing the inference speed of 1 versus 16 node job arrays demonstrated a 90.95% decrease in compute time for 32 GB input volume and 88.41% for 4 GB input volume. The general pipeline may serve as a baseline for future improved implementations on larger input volumes which can be tuned to various application domains.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126634290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance","authors":"P. Luszczek, Cade Brown","doi":"10.1109/HPEC55821.2022.9926401","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926401","url":null,"abstract":"We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science-oriented metrics as the relevant figures of merit.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}