Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus
{"title":"Designing Effective Sparse Expert Models","authors":"Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus","doi":"10.1109/IPDPSW55747.2022.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00171","url":null,"abstract":"Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117273673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter","authors":"Shulei Xu, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00083","url":null,"abstract":"Recent advances in HPC Cloud field has made multi-core high performance VM services more accessible. Emerging Arm based HPC systems are also receiving more attention. Amazon Web Service recently announced new c6gn instances with Gravition 2 Arm CPU on each node and support of Elastic Fabric Adapter, which make them the leading high performance Arm-based cloud system vendor. In this paper, we characterize the performance and capability of the AWS Arm architecture. We explore the performance optimization of current MPI libraries based on features of Arm-based cloud systems and Scalable Reliable Datagram protocol of Elastic Fabric Adapter and evaluate the impact of our optimization of high-performance MPI libraries. Our study shows that the performance optimization for MPI library on AWS Arm systems significantly improves the performance of MPI communication for both benchmark and application level. We gain up to 86% performance improvement in micro-benchmark level col-lective communication operations and up to 9% improvement in Weather Research and Forecasting application level. This paper provides a comprehensive performance evaluation for several popular MPI libraries on AWS Arm-based Cloud systems with EFA support. HPC application developers and users are able to get insights from our study to achieve better performance of their applications on Arm-based cloud systems with EFA support.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAW 2022 Keynote Speaker 1: Using FPGAs in datacenters and the cloud","authors":"G. Alonso","doi":"10.1109/IPDPSW55747.2022.00020","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00020","url":null,"abstract":"Several trends in the IT industry are driving an increasing specialization of the hardware layers. On the one hand, demanding workloads, large data volumes, diversity in data types, etc. are all factors contributing to make general purpose computing too inefficient. On the other hand, cloud computing and its economies of scale allow vendors to invest on specialized hardware for particular tasks that otherwise would be too expensive or consume resources needed elsewhere. In this talk I will discuss the shift towards hardware acceleration and show with several examples from industry and from research the large role that FPGAs could play.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115643788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching Heterogeneous Computing Using DPC++","authors":"J. Fuentes, Daniel López, Sebastián González","doi":"10.1109/IPDPSW55747.2022.00069","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00069","url":null,"abstract":"The evolution of modern computer systems with conventional processors to complex hardware units with heterogeneous accelerators is a reality. In the last decade, the awareness of teaching parallel computing in undergraduate programs has increased, however, the focus has been mainly on multi-core CPUs. GPUs, FPGAs, and other accelerators are now present in most of the devices people use daily, but their programming is still left to experienced engineers. New high-level programming languages for heterogeneous architectures such as DPC++ represent a good opportunity to bring closer inexperienced programmers to accelerators. In this paper, we present a new Heterogeneous Computing course with a syllabus focused on the foundations of heterogeneous architectures (multi-core CPUs, GPUs, and FPGAs) and their programming with DPC++. We present results from the experience of teaching this course to undergraduate students. Student evaluation and assessment data show that students engaged with the course's learning activities and there is high satisfaction with the contents covered","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116400370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Kojima, B. Adhi, Carlos Cortes, Y. Tan, K. Sano
{"title":"An Architecture- Independent CGRA Compiler enabling OpenMP Applications","authors":"Takuya Kojima, B. Adhi, Carlos Cortes, Y. Tan, K. Sano","doi":"10.1109/IPDPSW55747.2022.00112","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00112","url":null,"abstract":"Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we do not have a canonical programming model and a frontend compiler for it. Existing versatile CGRAs, in respect to their execution model, computational capability, and system structure, magnify the difficulty of orchestrating the compiler techniques. It consequently forces designers of the CGRAs to develop the compiler from scratch, working only for their architectures. Such an approach is outdated, given other successful accelerators like GPU and FPGAs. This paper presents a new CGRA compiler framework in order to reduce development efforts of CG RA applications. OpenMP annotated codes are fed into the proposed compiler, as recent OpenMP support device offloading to the accelerators. This property improves the reusability of the existing source code for HPC workloads. The design of the compiler is inspired by LLVM, which is the most famous compiler framework so that the frontend is built to be architecture-independent. In this work, we demonstrate that the proposed compiler can handle different types of CG RAs without changing the source codes. In addition, we discuss the effect of architecture-independent optimization algorithms. We also provide an open-source implementation of the compiler framework at https://github.com/hal-lab-u-tokyo/CGRAOmp.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"257 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115456396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Triangulation on the High Bandwidth Memory Model","authors":"K. Nakano, V. Poupet","doi":"10.1109/IPDPSW55747.2022.00089","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00089","url":null,"abstract":"The High Bandwidth Memory (HBM) model is a theoretical computing model consisting of a logic circuit with a large external memory. Each address of the external memory can store $p$ elements which can be read or written at the same time. Access to $p$ elements stored at a given address in the external memory has a latency of $l$ clock cycles. However, access to any $k$ consecutive addresses can be done only in $(k+l-1)$ clock cycles in a pipeline fashion by burst mode. A hardware algorithm is implemented in a logic circuit of the HBM to solve a particular problem. In this paper, we present an optimal implementation of the $O(n^{3})$ -time dynamic programming algorithm for solving the optimal polygon triangulation (OPT) problem which is a problem to find a triangulation with minimum total weight of an input convex n-gon with weighted cords. We assume that the input weight matrix of a convex n-gon is stored in the external memory of the HBM model. Our hardware algorithm implemented in the logic circuit of size $O(s^{2})$ operates on it and computes the optimal polygon triangulation of the input polygon in $O(frac{n^{3}}{sp}+frac{n^{3}}{s^{2}}+frac{n^{3}}{s^{3}}l)$ time. We also provide a theoretical proof showing that any hardware algorithm in a logic circuit of size $O(s^{2})$ takes at least $Omega(frac{n^{3}}{sp}+frac{n^{3}}{s^{2}})$ time to solve the OPT problem. Thus, our implementation is optimal whenever $s^{2}geq lp$ or $sgeq l$, and this optimality condition is always satisfied from a practical point of view.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115015687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The First International Workshop on COmputing using EmeRging EXotic AI-Inspired Systems (CORtEX'22)","authors":"","doi":"10.1109/IPDPSW55747.2022.00212","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00212","url":null,"abstract":"","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CORtEX 2022 Invited Speaker 3: Neuromorphic computing: from modelling the brain to bio-inspired AI","authors":"Oliver Rhodes","doi":"10.1109/IPDPSW55747.2022.00215","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00215","url":null,"abstract":"This talk will introduce the field of neuromorphic computing: researching how to build machines to explore brain function; and using our enhanced understanding of the brain to build better computer hardware and algorithms. Specifically it will discuss spiking neural networks, including how they can be used to model neural circuits, and how these models can be harnessed to develop low-power bio-inspired AI systems.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyuan Yang, Yemeng Zhang, Bohan Yang, Hanning Wang, S. Yin, Shaojun Wei, Leibo Liu
{"title":"A SHA-512 Hardware Implementation Based on Block RAM Storage Structure","authors":"Mingyuan Yang, Yemeng Zhang, Bohan Yang, Hanning Wang, S. Yin, Shaojun Wei, Leibo Liu","doi":"10.1109/IPDPSW55747.2022.00031","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00031","url":null,"abstract":"The Secure Hash Algorithms (SHAs) are essential building blocks of modern cryptographic systems. The imple-mentation dimensions of secure hash algorithms are explored for different application scenarios. Cloud servers may favor an implementation with considerable throughput, while a compact implementation with acceptable speed and sustainable power is crucial for the Internet of Things (IoT). In this paper, we present an implementation of SHA-512 for FPGA platform based on Block RAM (BRAM) storage structure. Three implementation techniques are proposed to facilitate the usage of BRAMs as replacements for Look-Up Tables (LUTs) and Flip-Flops (FFs) to achieve a balanced FPGA utilization. Compared to other FPGA implementations of SHA-512, our design has one of the smallest slice consumption while maintaining a moderate but sufficient throughput for cryptographic applications like the post-processing of true random number generators (TRNGs).","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121380648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiCOMB 2022 Invited Speaker: Pandemic-scale Phylogenetics","authors":"Yatish Turakhia","doi":"10.1109/IPDPSW55747.2022.00035","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00035","url":null,"abstract":"Phylogenetics has been central to the genomic surveillance, epidemiology and contact tracing efforts during the COVD-19 pandemic. But the massive scale of genomic sequencing has rendered the pre-pandemic tools quite inadequate for comprehensive phylogenetic analyses. In this talk, I will discuss a high-performance computing (HPC) phylogenetic package that we developed to address the needs imposed by this pandemic. Orders of magnitude gains were achieved by this package through several domain-specific optimization and parallelization techniques. The package comprises four programs: UShER, matOptimize, RIPPLES and matUtils. Using high-performance computing, UShER and matOptimize maintain and refine daily a massive mutation-annotated phylogenetic tree consisting of all (>9M currently) SARSCoV-2 sequences available on online repositories. With UShER and RIPPLES, individual labs - even with modest compute resources - incorporate newly-sequenced SARS-CoV-2 genomes on this phylogeny and discover evidence for recombination in real-time. With matUtils, they rapidly query and visualize massive SARS-CoV-2 phylogenies. This has empowered scientists worldwide to study the SARS-CoV-2 evolutionary and transmission dynamics at an unprecedented scale, resolution and speed. This has laid the groundwork for future genomic surveillance of MOST infectious pathogens.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"207 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121454278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}