2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献_第4页

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems GPU系统的高效Alltoall和Alltoallv通信算法

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00014

Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda

{"title":"Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00014","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00014","url":null,"abstract":"In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129290473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CGRA4HPC 2022 Invited Speaker: Mapping ML to the AMD/Xilinx AIE-ML architecture CGRA4HPC 2022特邀演讲者:将机器学习映射到AMD/赛灵思ai -ML架构

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00109

Elliott Delaye

引用次数: 0

Decentralized in-order execution of a sequential task-based code for shared-memory architectures 用于共享内存体系结构的基于顺序任务的代码的分散顺序执行

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00095

Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard

{"title":"Decentralized in-order execution of a sequential task-based code for shared-memory architectures","authors":"Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard","doi":"10.1109/IPDPSW55747.2022.00095","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00095","url":null,"abstract":"The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with advanced runtime systems allow the programmer to write a sequential algorithm independently of the hardware architecture in a productive and portable manner, and let a third party software layer -the runtime system- deal with the burden of scheduling a correct, parallel execution of that algorithm to ensure performance. Many HPC algorithms have successfully been implemented following this paradigm, as a testimony of its effectiveness. Developing algorithms that specifically require fine-grained tasks along this model is still considered prohibitive, however, due to per-task management overhead [1], forcing the programmer to resort to a less abstract, and hence more complex “task+X” model. We thus investigate the possibility to offer a tailored execution model, trading dynamic mapping for efficiency by using a decentralized, conservative in-order execution of the task flow, while preserving the benefits of relying on the sequential task-based programming model. We propose a formal specification of the execution model as well as a prototype implementation, which we assess on a shared-memory multicore architecture with several synthetic workloads. The results show that under the condition of a proper task mapping supplied by the programmer, the pressure on the runtime system is significantly reduced and the execution of fine-grained task flows is much more efficient.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120898183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A SHA-512 Hardware Implementation Based on Block RAM Storage Structure 基于块RAM存储结构的SHA-512硬件实现

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00031

Mingyuan Yang, Yemeng Zhang, Bohan Yang, Hanning Wang, S. Yin, Shaojun Wei, Leibo Liu

引用次数: 0

The First International Workshop on COmputing using EmeRging EXotic AI-Inspired Systems (CORtEX'22) 第一届使用新兴的外来人工智能启发系统计算的国际研讨会(CORtEX'22)

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00212

引用次数: 0

Teaching Heterogeneous Computing Using DPC++ 基于dpc++的异构计算教学

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00069

J. Fuentes, Daniel López, Sebastián González

引用次数: 4

Heterogeneous GPU and FPGA computing: a VexCL case-study 异构GPU和FPGA计算:一个VexCL案例研究

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00073

Tristan Laan, A. Varbanescu

{"title":"Heterogeneous GPU and FPGA computing: a VexCL case-study","authors":"Tristan Laan, A. Varbanescu","doi":"10.1109/IPDPSW55747.2022.00073","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00073","url":null,"abstract":"FPGA-based accelerators are capturing the interest of the HPC domain, primarily due to their superior energy-efficiency compared to more common accelerators, like GPUs. However, enabling HPC codes to use FPGA-based accelerators (efficiently) remains a difficult task. One interesting, fast-track solution to this problem is to extend the domain-specific, high-level languages, libraries, or APIs that already support other accelerators (e.g., GPUs) to target FPGAs. In this work we demonstrate the added value of such an approach by adding FPGA support to VexCL, a vector expression template library for OpenCL/CUDA. To this end, we use the VexCL-generated OpenCL code as intermediate representation, while creating code-skeletons to implement the FPGA code and all necessary data links between the host and accelerator. We further support five generic optimizations for the FPGA code. We demonstrate our approach on two use-cases, an affine transformation and an SpMV calculation, showcasing the performance and energy consumption of the resulting FPGA versions. We further demonstrate that the FPGA code can outperform the VexCL-generated GPU version. To illustrate the integration of GPU and FPGA code, we also demonstrate the performance of an VexCL SpMV application using a heterogeneous GPU+FPGA system. Our results indicate that, indeed, the integration of the two accelerators is seamless. Performance-wise, however, the heterogeneous version does not outperform the FPGA-only one.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126599697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HiCOMB 2022 Invited Speaker: Pandemic-scale Phylogenetics HiCOMB 2022特邀演讲者:大流行规模的系统发育

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00035

Yatish Turakhia

{"title":"HiCOMB 2022 Invited Speaker: Pandemic-scale Phylogenetics","authors":"Yatish Turakhia","doi":"10.1109/IPDPSW55747.2022.00035","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00035","url":null,"abstract":"Phylogenetics has been central to the genomic surveillance, epidemiology and contact tracing efforts during the COVD-19 pandemic. But the massive scale of genomic sequencing has rendered the pre-pandemic tools quite inadequate for comprehensive phylogenetic analyses. In this talk, I will discuss a high-performance computing (HPC) phylogenetic package that we developed to address the needs imposed by this pandemic. Orders of magnitude gains were achieved by this package through several domain-specific optimization and parallelization techniques. The package comprises four programs: UShER, matOptimize, RIPPLES and matUtils. Using high-performance computing, UShER and matOptimize maintain and refine daily a massive mutation-annotated phylogenetic tree consisting of all (>9M currently) SARSCoV-2 sequences available on online repositories. With UShER and RIPPLES, individual labs - even with modest compute resources - incorporate newly-sequenced SARS-CoV-2 genomes on this phylogeny and discover evidence for recombination in real-time. With matUtils, they rapidly query and visualize massive SARS-CoV-2 phylogenies. This has empowered scientists worldwide to study the SARS-CoV-2 evolutionary and transmission dynamics at an unprecedented scale, resolution and speed. This has laid the groundwork for future genomic surveillance of MOST infectious pathogens.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"207 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121454278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CORtEX 2022 Invited Speaker 3: Neuromorphic computing: from modelling the brain to bio-inspired AI 皮层2022特邀演讲者3:神经形态计算:从大脑建模到仿生人工智能

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00215

Oliver Rhodes

引用次数: 0

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter Arm遇上云:基于AWS Arm的高性能计算云上MPI库性能的案例研究

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI: 10.1109/IPDPSW55747.2022.00083

Shulei Xu, A. Shafi, H. Subramoni, D. Panda

{"title":"Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter","authors":"Shulei Xu, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00083","url":null,"abstract":"Recent advances in HPC Cloud field has made multi-core high performance VM services more accessible. Emerging Arm based HPC systems are also receiving more attention. Amazon Web Service recently announced new c6gn instances with Gravition 2 Arm CPU on each node and support of Elastic Fabric Adapter, which make them the leading high performance Arm-based cloud system vendor. In this paper, we characterize the performance and capability of the AWS Arm architecture. We explore the performance optimization of current MPI libraries based on features of Arm-based cloud systems and Scalable Reliable Datagram protocol of Elastic Fabric Adapter and evaluate the impact of our optimization of high-performance MPI libraries. Our study shows that the performance optimization for MPI library on AWS Arm systems significantly improves the performance of MPI communication for both benchmark and application level. We gain up to 86% performance improvement in micro-benchmark level col-lective communication operations and up to 9% improvement in Weather Research and Forecasting application level. This paper provides a comprehensive performance evaluation for several popular MPI libraries on AWS Arm-based Cloud systems with EFA support. HPC application developers and users are able to get insights from our study to achieve better performance of their applications on Arm-based cloud systems with EFA support.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2