Kumudha Narasimhan, Ouadie El Farouki, M. Goli, Muhammad Tanvir, S. Georgiev, Isaac Ault
{"title":"Towards performance portability of AI graphs using SYCL","authors":"Kumudha Narasimhan, Ouadie El Farouki, M. Goli, Muhammad Tanvir, S. Georgiev, Isaac Ault","doi":"10.1109/P3HPC56579.2022.00016","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00016","url":null,"abstract":"The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers.While the proposed ONNX as a de facto for DNN model description, provides portability across various AI frameworks, supporting DNN models on various hardware architectures remains challenging.SYCL provides a C++-based portable parallel programming model to target various devices. Thus, enabling SYCL backend for an AI framework can lead to a hardware-agnostic model for heterogeneous systems.This paper proposes a SYCL backend for ONNXRuntime as a possible solution towards the performance portability of deep learning algorithms. The proposed backend uses existing state-of-the-art SYCL-DNN and SYCL-BLAS libraries to invoke tuned SYCL kernels for DNN operations. Our performance evaluation shows that the proposed approach can achieve comparable performance with respect to the state-of-the-art optimized vendor-specific libraries.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125651831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Strong Scaling on GPUs Using Empirical Performance Saturation Size","authors":"David Eberius, P. Roth, D. Rogers","doi":"10.1109/P3HPC56579.2022.00008","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00008","url":null,"abstract":"The roofline model provides a concise overview of the maximum performance capabilities of a given computer system through a combination of peak memory bandwidth and compute performance rates. The increasing complexity of scheduling and cache in recent GPUs, however, has introduced complicated performance variability that is not captured by arithmetic intensity alone. This work examines the effect of problem size and GPU launch configurations on roofline performance for V100, A100, MI100, and MI250X graphics processing units. We introduce an extended roofline model that takes problem size into account, and find that strong scaling on GPUs can be characterized by saturation problem sizes as additional key metrics. Saturation problem sizes break up a plot of GPU performance vs. problem size into three distinct performance regimes– size-limited, cache-bound, and DRAM-bound. With our extended roofline model, we are able to provide a robust view of these performance regimes across recent GPU architectures.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114454299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Gates, A. YarKhan, D. Sukkari, Kadir Akbudak, S. Cayrols, Daniel Bielich, A. Abdelfattah, Mohammed Al Farhan, J. Dongarra
{"title":"Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era","authors":"M. Gates, A. YarKhan, D. Sukkari, Kadir Akbudak, S. Cayrols, Daniel Bielich, A. Abdelfattah, Mohammed Al Farhan, J. Dongarra","doi":"10.1109/P3HPC56579.2022.00009","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00009","url":null,"abstract":"The SLATE project is implementing a distributed dense linear algebra library for highly-scalable distributed-memory accelerator-based computer systems. The goal is to provide a library that can be easily ported to different hardware (CPUs, GPUs, accelerators) and will provide high performance for machines into the future. Current ports include CPUs, CUDA, ROCm, and oneAPI. We achieve both performance and portability by leveraging several layers and abstractions, including OpenMP tasks to track data dependencies, MPI for distributed communication, and the BLAS++ and LAPACK++ libraries developed as a portable layer across vendor-optimized CPU and GPU BLAS and LAPACK functionality. We rely on the C++ standard library and templating to reduce code duplication for better maintainability. The few kernels not present in BLAS are implemented in CUDA, HIP, and OpenMP target offload, and are easily ported to new platforms.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125218830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek Bagusetty, Ajay Panyala, Gavin Brown, Jack Kirk
{"title":"Towards Cross-Platform Portability of Coupled-Cluster Methods with Perturbative Triples using SYCL","authors":"Abhishek Bagusetty, Ajay Panyala, Gavin Brown, Jack Kirk","doi":"10.1109/P3HPC56579.2022.00013","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00013","url":null,"abstract":"Tensor contractions form the fundamental computational operation of computational chemistry, and these contractions dictate the performance of widely used coupled-cluster (CC) methods in computational chemistry. In this work, we study a single-source, cross-platform C++ abstraction layer programming model, SYCL, for applications related to the computational chemistry methods such as CCSD(T) coupled-cluster formalism. An existing optimized CUDA implementation was migrated to SYCL to make use of the novel algorithm that provides tractable GPU memory needs for solving high-dimensional tensor contractions for accelerating CCSD(T). We present the cross-platform performance achieved using SYCL implementations for the non-iterative triples contribution of the CCSD(T) formalism which is considered as the performance bottle neck on NVIDIA A100 and AMD Instinct MI250X. Additionally, we also draw comparisons of similar performance metrics from vendor-based native programming models such as CUDA and ROCm HIP. Our results indicate that the performance of SYCL measured at-scale was on-par with the code written in HIP for AMD MI250X GPUs while the performance is slightly lacking on NVIDIA A100 GPUs in comparison to CUDA.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133588871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs","authors":"K. Ibrahim, Chao Yang, Pieter Maris","doi":"10.1109/P3HPC56579.2022.00011","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00011","url":null,"abstract":"The emergence of accelerator-based computer architectures and programming models makes it challenging to achieve performance portability for large-scale scientific simulation software. In this paper, we focus on a sparse block diagonal matrix multiple vector (SpMM) computational kernel and discuss techniques that can be used to achieve performance portability on NVIDIA and AMD based accelerators using CUDA, HIP, OpenACC, Kokkos. We show that performance portability can vary significantly across programming models, GPU architectures, and problem settings, by up to 52× in the explored problems. Our study visits the performance portability aggregation techniques to guide the development and the selection of performance portable algorithmic variants.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114750582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tom Deakin, J. Cownie, Wei-Chen Lin, Simon McIntosh-Smith
{"title":"Heterogeneous Programming for the Homogeneous Majority","authors":"Tom Deakin, J. Cownie, Wei-Chen Lin, Simon McIntosh-Smith","doi":"10.1109/P3HPC56579.2022.00006","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00006","url":null,"abstract":"In order to take advantage of the burgeoning diversity in processors at the frontier of supercomputing, the HPC community is migrating and improving codes to utilise heterogeneous nodes, where accelerators, principally GPUs, are highly prevalent in top-tier supercomputer designs. Programs therefore need to embrace at least some of the complexities of heterogeneous architectures. Parallel programming models have evolved to express heterogeneous paradigms whilst providing mechanisms for writing portable, performant programs. History shows that technologies first introduced at the frontier percolate down to local workhorse systems. However, we expect there will always be a mix of systems, some heterogeneous, but some remaining as homogeneous CPU systems. Thus it is important to ensure codes adapted for heterogeneous systems continue to run efficiently on CPUs. In this study, we explore how well widely used heterogeneous programming models perform on CPU-only platforms, and survey the performance portability they offer on the latest CPU architectures.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122548284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Asahi, T. Padioleau, G. Latu, Julien Bigot, V. Grandgirard, K. Obrejan
{"title":"Performance portable Vlasov code with C++ parallel algorithm","authors":"Y. Asahi, T. Padioleau, G. Latu, Julien Bigot, V. Grandgirard, K. Obrejan","doi":"10.1109/P3HPC56579.2022.00012","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00012","url":null,"abstract":"This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20 % to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115112655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Parasyris, G. Georgakoudis, J. Doerfert, I. Laguna, T. Scogland
{"title":"Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance","authors":"K. Parasyris, G. Georgakoudis, J. Doerfert, I. Laguna, T. Scogland","doi":"10.1109/P3HPC56579.2022.00015","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00015","url":null,"abstract":"OpenMP offload improves the application development complexity of HPC GPU codes and provides portability. A source of poor performance is the lockstep execution of data transfers and computation. Overlapping these operations can provide significant performance gains. However, the developer must manually slice data transfers and kernel execution, and efficiently schedule these operations for execution, which is a hard and error-prone task.We propose Piper, an automatic mechanism for OpenMP offload to perform overlapping. Piper statically analyzes offload kernels and associates computations with memory locations. The extended runtime system exploits this analysis information, divides a kernel into independent sub-tasks, and schedules them for pipelined execution for overlapping. At any point in time, Piper also controls the coarseness and number of sub-tasks executed. By doing so, Piper allows the execution of kernels whose memory requirements exceed the GPU device memory. Piper speeds up execution up to 2.67× compared to OpenMP offload execution.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128136129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacob Lambert, Mohammad Alaul Haque Monil, Seyong Lee, A. Malony, J. Vetter
{"title":"Leveraging Compiler-Based Translation to Evaluate a Diversity of Exascale Platforms","authors":"Jacob Lambert, Mohammad Alaul Haque Monil, Seyong Lee, A. Malony, J. Vetter","doi":"10.1109/P3HPC56579.2022.00007","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00007","url":null,"abstract":"Accelerator-based heterogeneous computing is the de facto standard in current and upcoming exascale machines. These heterogeneous resources empower computational scientists to select a machine or platform well-suited to their domain or applications. However, this diversity of machines also poses challenges related to programming model selection: inconsistent availability of programming models across different exascale systems, lack of performance portability for those programming models that do span several systems, and inconsistent performance between different models on a single platform. We explore these challenges on exascale-similar hardware, including AMD MI100 and NVIDIA A100 GPUs. By extending the sourceto-source compiler OpenARC, we demonstrate the power of automated translation of applications written in a single frontend programming model (OpenACC) into a variety of backend models (OpenMP, OpenCL, CUDA, HIP) that span the upcoming exascale environments. This translation enables us to compare performance within and across devices and to analyze programming model behavior with profiling tools.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gregor Daiß, Patrick Diehl, Dominic C. Marcello, Alireza Kheirkhahan, H. Kaiser, D. Pflüger
{"title":"From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels","authors":"Gregor Daiß, Patrick Diehl, Dominic C. Marcello, Alireza Kheirkhahan, H. Kaiser, D. Pflüger","doi":"10.1109/P3HPC56579.2022.00014","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00014","url":null,"abstract":"Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129929286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}