2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)最新文献

Towards performance portability of AI graphs using SYCL 使用SYCL实现AI图的性能可移植性

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00016

Kumudha Narasimhan, Ouadie El Farouki, M. Goli, Muhammad Tanvir, S. Georgiev, Isaac Ault

引用次数: 0

Understanding Strong Scaling on GPUs Using Empirical Performance Saturation Size 使用经验性能饱和尺寸理解gpu的强缩放

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00008

David Eberius, P. Roth, D. Rogers

引用次数: 1

Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era 百亿亿次计算机时代初期的便携式高效密集线性代数

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00009

M. Gates, A. YarKhan, D. Sukkari, Kadir Akbudak, S. Cayrols, Daniel Bielich, A. Abdelfattah, Mohammed Al Farhan, J. Dongarra

引用次数: 0

Towards Cross-Platform Portability of Coupled-Cluster Methods with Perturbative Triples using SYCL 利用SYCL实现微扰三元组耦合簇方法的跨平台可移植性

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00013

Abhishek Bagusetty, Ajay Panyala, Gavin Brown, Jack Kirk

{"title":"Towards Cross-Platform Portability of Coupled-Cluster Methods with Perturbative Triples using SYCL","authors":"Abhishek Bagusetty, Ajay Panyala, Gavin Brown, Jack Kirk","doi":"10.1109/P3HPC56579.2022.00013","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00013","url":null,"abstract":"Tensor contractions form the fundamental computational operation of computational chemistry, and these contractions dictate the performance of widely used coupled-cluster (CC) methods in computational chemistry. In this work, we study a single-source, cross-platform C++ abstraction layer programming model, SYCL, for applications related to the computational chemistry methods such as CCSD(T) coupled-cluster formalism. An existing optimized CUDA implementation was migrated to SYCL to make use of the novel algorithm that provides tractable GPU memory needs for solving high-dimensional tensor contractions for accelerating CCSD(T). We present the cross-platform performance achieved using SYCL implementations for the non-iterative triples contribution of the CCSD(T) formalism which is considered as the performance bottle neck on NVIDIA A100 and AMD Instinct MI250X. Additionally, we also draw comparisons of similar performance metrics from vendor-based native programming models such as CUDA and ROCm HIP. Our results indicate that the performance of SYCL measured at-scale was on-par with the code written in HIP for AMD MI250X GPUs while the performance is slightly lacking on NVIDIA A100 GPUs in comparison to CUDA.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133588871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs 稀疏块对角矩阵多向量乘法在gpu上的性能可移植性

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00011

K. Ibrahim, Chao Yang, Pieter Maris

引用次数: 2

Heterogeneous Programming for the Homogeneous Majority 同质多数的异构编程

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00006

Tom Deakin, J. Cownie, Wei-Chen Lin, Simon McIntosh-Smith

引用次数: 1

Performance portable Vlasov code with C++ parallel algorithm 性能可移植的Vlasov代码，带有c++并行算法

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00012

Y. Asahi, T. Padioleau, G. Latu, Julien Bigot, V. Grandgirard, K. Obrejan

引用次数: 0

Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance Piper:通过编译器优化性能的管道OpenMP卸载执行

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00015

K. Parasyris, G. Georgakoudis, J. Doerfert, I. Laguna, T. Scogland

{"title":"Piper: Pipelining OpenMP Offloading Execution Through Compiler Optimization For Performance","authors":"K. Parasyris, G. Georgakoudis, J. Doerfert, I. Laguna, T. Scogland","doi":"10.1109/P3HPC56579.2022.00015","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00015","url":null,"abstract":"OpenMP offload improves the application development complexity of HPC GPU codes and provides portability. A source of poor performance is the lockstep execution of data transfers and computation. Overlapping these operations can provide significant performance gains. However, the developer must manually slice data transfers and kernel execution, and efficiently schedule these operations for execution, which is a hard and error-prone task.We propose Piper, an automatic mechanism for OpenMP offload to perform overlapping. Piper statically analyzes offload kernels and associates computations with memory locations. The extended runtime system exploits this analysis information, divides a kernel into independent sub-tasks, and schedules them for pipelined execution for overlapping. At any point in time, Piper also controls the coarseness and number of sub-tasks executed. By doing so, Piper allows the execution of kernels whose memory requirements exceed the GPU device memory. Piper speeds up execution up to 2.67× compared to OpenMP offload execution.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128136129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Compiler-Based Translation to Evaluate a Diversity of Exascale Platforms 利用基于编译器的翻译来评估百亿亿级平台的多样性

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI: 10.1109/P3HPC56579.2022.00007

Jacob Lambert, Mohammad Alaul Haque Monil, Seyong Lee, A. Malony, J. Vetter

{"title":"Leveraging Compiler-Based Translation to Evaluate a Diversity of Exascale Platforms","authors":"Jacob Lambert, Mohammad Alaul Haque Monil, Seyong Lee, A. Malony, J. Vetter","doi":"10.1109/P3HPC56579.2022.00007","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00007","url":null,"abstract":"Accelerator-based heterogeneous computing is the de facto standard in current and upcoming exascale machines. These heterogeneous resources empower computational scientists to select a machine or platform well-suited to their domain or applications. However, this diversity of machines also poses challenges related to programming model selection: inconsistent availability of programming models across different exascale systems, lack of performance portability for those programming models that do span several systems, and inconsistent performance between different models on a single platform. We explore these challenges on exascale-similar hardware, including AMD MI100 and NVIDIA A100 GPUs. By extending the sourceto-source compiler OpenARC, we demonstrate the power of automated translation of applications written in a single frontend programming model (OpenACC) into a variety of backend models (OpenMP, OpenCL, CUDA, HIP) that span the upcoming exascale environments. This translation enables us to compare performance within and across devices and to analyze programming model behavior with profiling tools.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels 从基于任务的GPU工作聚合到恒星合并:将细粒度CPU任务转化为可移植的GPU内核

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-09-26 DOI: 10.1109/P3HPC56579.2022.00014

Gregor Daiß, Patrick Diehl, Dominic C. Marcello, Alireza Kheirkhahan, H. Kaiser, D. Pflüger

{"title":"From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels","authors":"Gregor Daiß, Patrick Diehl, Dominic C. Marcello, Alireza Kheirkhahan, H. Kaiser, D. Pflüger","doi":"10.1109/P3HPC56579.2022.00014","DOIUrl":"https://doi.org/10.1109/P3HPC56579.2022.00014","url":null,"abstract":"Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129929286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3