Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
{"title":"Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00014","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00014","url":null,"abstract":"In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129290473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGRA4HPC 2022 Invited Speaker: Mapping ML to the AMD/Xilinx AIE-ML architecture","authors":"Elliott Delaye","doi":"10.1109/IPDPSW55747.2022.00109","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00109","url":null,"abstract":"In the field of compute acceleration, machine learning model acceleration is one of the fastest growing areas of focus. ML model complexity in both compute and memory have driven the latest accelerator architectures and with that, developing ways to efficiently use these new architectures is the key to unlocking their potential. At AMD the AIE-ML architecture is our second generation AI-Engine architecture and we will dive into some of the ways we map the most important ML compute/bandwidth requirements to this architecture.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127695322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard
{"title":"Decentralized in-order execution of a sequential task-based code for shared-memory architectures","authors":"Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard","doi":"10.1109/IPDPSW55747.2022.00095","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00095","url":null,"abstract":"The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with advanced runtime systems allow the programmer to write a sequential algorithm independently of the hardware architecture in a productive and portable manner, and let a third party software layer -the runtime system- deal with the burden of scheduling a correct, parallel execution of that algorithm to ensure performance. Many HPC algorithms have successfully been implemented following this paradigm, as a testimony of its effectiveness. Developing algorithms that specifically require fine-grained tasks along this model is still considered prohibitive, however, due to per-task management overhead [1], forcing the programmer to resort to a less abstract, and hence more complex “task+X” model. We thus investigate the possibility to offer a tailored execution model, trading dynamic mapping for efficiency by using a decentralized, conservative in-order execution of the task flow, while preserving the benefits of relying on the sequential task-based programming model. We propose a formal specification of the execution model as well as a prototype implementation, which we assess on a shared-memory multicore architecture with several synthetic workloads. The results show that under the condition of a proper task mapping supplied by the programmer, the pressure on the runtime system is significantly reduced and the execution of fine-grained task flows is much more efficient.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120898183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyuan Yang, Yemeng Zhang, Bohan Yang, Hanning Wang, S. Yin, Shaojun Wei, Leibo Liu
{"title":"A SHA-512 Hardware Implementation Based on Block RAM Storage Structure","authors":"Mingyuan Yang, Yemeng Zhang, Bohan Yang, Hanning Wang, S. Yin, Shaojun Wei, Leibo Liu","doi":"10.1109/IPDPSW55747.2022.00031","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00031","url":null,"abstract":"The Secure Hash Algorithms (SHAs) are essential building blocks of modern cryptographic systems. The imple-mentation dimensions of secure hash algorithms are explored for different application scenarios. Cloud servers may favor an implementation with considerable throughput, while a compact implementation with acceptable speed and sustainable power is crucial for the Internet of Things (IoT). In this paper, we present an implementation of SHA-512 for FPGA platform based on Block RAM (BRAM) storage structure. Three implementation techniques are proposed to facilitate the usage of BRAMs as replacements for Look-Up Tables (LUTs) and Flip-Flops (FFs) to achieve a balanced FPGA utilization. Compared to other FPGA implementations of SHA-512, our design has one of the smallest slice consumption while maintaining a moderate but sufficient throughput for cryptographic applications like the post-processing of true random number generators (TRNGs).","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121380648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The First International Workshop on COmputing using EmeRging EXotic AI-Inspired Systems (CORtEX'22)","authors":"","doi":"10.1109/IPDPSW55747.2022.00212","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00212","url":null,"abstract":"","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching Heterogeneous Computing Using DPC++","authors":"J. Fuentes, Daniel López, Sebastián González","doi":"10.1109/IPDPSW55747.2022.00069","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00069","url":null,"abstract":"The evolution of modern computer systems with conventional processors to complex hardware units with heterogeneous accelerators is a reality. In the last decade, the awareness of teaching parallel computing in undergraduate programs has increased, however, the focus has been mainly on multi-core CPUs. GPUs, FPGAs, and other accelerators are now present in most of the devices people use daily, but their programming is still left to experienced engineers. New high-level programming languages for heterogeneous architectures such as DPC++ represent a good opportunity to bring closer inexperienced programmers to accelerators. In this paper, we present a new Heterogeneous Computing course with a syllabus focused on the foundations of heterogeneous architectures (multi-core CPUs, GPUs, and FPGAs) and their programming with DPC++. We present results from the experience of teaching this course to undergraduate students. Student evaluation and assessment data show that students engaged with the course's learning activities and there is high satisfaction with the contents covered","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116400370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous GPU and FPGA computing: a VexCL case-study","authors":"Tristan Laan, A. Varbanescu","doi":"10.1109/IPDPSW55747.2022.00073","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00073","url":null,"abstract":"FPGA-based accelerators are capturing the interest of the HPC domain, primarily due to their superior energy-efficiency compared to more common accelerators, like GPUs. However, enabling HPC codes to use FPGA-based accelerators (efficiently) remains a difficult task. One interesting, fast-track solution to this problem is to extend the domain-specific, high-level languages, libraries, or APIs that already support other accelerators (e.g., GPUs) to target FPGAs. In this work we demonstrate the added value of such an approach by adding FPGA support to VexCL, a vector expression template library for OpenCL/CUDA. To this end, we use the VexCL-generated OpenCL code as intermediate representation, while creating code-skeletons to implement the FPGA code and all necessary data links between the host and accelerator. We further support five generic optimizations for the FPGA code. We demonstrate our approach on two use-cases, an affine transformation and an SpMV calculation, showcasing the performance and energy consumption of the resulting FPGA versions. We further demonstrate that the FPGA code can outperform the VexCL-generated GPU version. To illustrate the integration of GPU and FPGA code, we also demonstrate the performance of an VexCL SpMV application using a heterogeneous GPU+FPGA system. Our results indicate that, indeed, the integration of the two accelerators is seamless. Performance-wise, however, the heterogeneous version does not outperform the FPGA-only one.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126599697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiCOMB 2022 Invited Speaker: Pandemic-scale Phylogenetics","authors":"Yatish Turakhia","doi":"10.1109/IPDPSW55747.2022.00035","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00035","url":null,"abstract":"Phylogenetics has been central to the genomic surveillance, epidemiology and contact tracing efforts during the COVD-19 pandemic. But the massive scale of genomic sequencing has rendered the pre-pandemic tools quite inadequate for comprehensive phylogenetic analyses. In this talk, I will discuss a high-performance computing (HPC) phylogenetic package that we developed to address the needs imposed by this pandemic. Orders of magnitude gains were achieved by this package through several domain-specific optimization and parallelization techniques. The package comprises four programs: UShER, matOptimize, RIPPLES and matUtils. Using high-performance computing, UShER and matOptimize maintain and refine daily a massive mutation-annotated phylogenetic tree consisting of all (>9M currently) SARSCoV-2 sequences available on online repositories. With UShER and RIPPLES, individual labs - even with modest compute resources - incorporate newly-sequenced SARS-CoV-2 genomes on this phylogeny and discover evidence for recombination in real-time. With matUtils, they rapidly query and visualize massive SARS-CoV-2 phylogenies. This has empowered scientists worldwide to study the SARS-CoV-2 evolutionary and transmission dynamics at an unprecedented scale, resolution and speed. This has laid the groundwork for future genomic surveillance of MOST infectious pathogens.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"207 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121454278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CORtEX 2022 Invited Speaker 3: Neuromorphic computing: from modelling the brain to bio-inspired AI","authors":"Oliver Rhodes","doi":"10.1109/IPDPSW55747.2022.00215","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00215","url":null,"abstract":"This talk will introduce the field of neuromorphic computing: researching how to build machines to explore brain function; and using our enhanced understanding of the brain to build better computer hardware and algorithms. Specifically it will discuss spiking neural networks, including how they can be used to model neural circuits, and how these models can be harnessed to develop low-power bio-inspired AI systems.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter","authors":"Shulei Xu, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00083","url":null,"abstract":"Recent advances in HPC Cloud field has made multi-core high performance VM services more accessible. Emerging Arm based HPC systems are also receiving more attention. Amazon Web Service recently announced new c6gn instances with Gravition 2 Arm CPU on each node and support of Elastic Fabric Adapter, which make them the leading high performance Arm-based cloud system vendor. In this paper, we characterize the performance and capability of the AWS Arm architecture. We explore the performance optimization of current MPI libraries based on features of Arm-based cloud systems and Scalable Reliable Datagram protocol of Elastic Fabric Adapter and evaluate the impact of our optimization of high-performance MPI libraries. Our study shows that the performance optimization for MPI library on AWS Arm systems significantly improves the performance of MPI communication for both benchmark and application level. We gain up to 86% performance improvement in micro-benchmark level col-lective communication operations and up to 9% improvement in Weather Research and Forecasting application level. This paper provides a comprehensive performance evaluation for several popular MPI libraries on AWS Arm-based Cloud systems with EFA support. HPC application developers and users are able to get insights from our study to achieve better performance of their applications on Arm-based cloud systems with EFA support.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}