Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus
{"title":"Designing Effective Sparse Expert Models","authors":"Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus","doi":"10.1109/IPDPSW55747.2022.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00171","url":null,"abstract":"Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117273673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Triangulation on the High Bandwidth Memory Model","authors":"K. Nakano, V. Poupet","doi":"10.1109/IPDPSW55747.2022.00089","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00089","url":null,"abstract":"The High Bandwidth Memory (HBM) model is a theoretical computing model consisting of a logic circuit with a large external memory. Each address of the external memory can store $p$ elements which can be read or written at the same time. Access to $p$ elements stored at a given address in the external memory has a latency of $l$ clock cycles. However, access to any $k$ consecutive addresses can be done only in $(k+l-1)$ clock cycles in a pipeline fashion by burst mode. A hardware algorithm is implemented in a logic circuit of the HBM to solve a particular problem. In this paper, we present an optimal implementation of the $O(n^{3})$ -time dynamic programming algorithm for solving the optimal polygon triangulation (OPT) problem which is a problem to find a triangulation with minimum total weight of an input convex n-gon with weighted cords. We assume that the input weight matrix of a convex n-gon is stored in the external memory of the HBM model. Our hardware algorithm implemented in the logic circuit of size $O(s^{2})$ operates on it and computes the optimal polygon triangulation of the input polygon in $O(frac{n^{3}}{sp}+frac{n^{3}}{s^{2}}+frac{n^{3}}{s^{3}}l)$ time. We also provide a theoretical proof showing that any hardware algorithm in a logic circuit of size $O(s^{2})$ takes at least $Omega(frac{n^{3}}{sp}+frac{n^{3}}{s^{2}})$ time to solve the OPT problem. Thus, our implementation is optimal whenever $s^{2}geq lp$ or $sgeq l$, and this optimality condition is always satisfied from a practical point of view.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115015687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Kojima, B. Adhi, Carlos Cortes, Y. Tan, K. Sano
{"title":"An Architecture- Independent CGRA Compiler enabling OpenMP Applications","authors":"Takuya Kojima, B. Adhi, Carlos Cortes, Y. Tan, K. Sano","doi":"10.1109/IPDPSW55747.2022.00112","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00112","url":null,"abstract":"Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we do not have a canonical programming model and a frontend compiler for it. Existing versatile CGRAs, in respect to their execution model, computational capability, and system structure, magnify the difficulty of orchestrating the compiler techniques. It consequently forces designers of the CGRAs to develop the compiler from scratch, working only for their architectures. Such an approach is outdated, given other successful accelerators like GPU and FPGAs. This paper presents a new CGRA compiler framework in order to reduce development efforts of CG RA applications. OpenMP annotated codes are fed into the proposed compiler, as recent OpenMP support device offloading to the accelerators. This property improves the reusability of the existing source code for HPC workloads. The design of the compiler is inspired by LLVM, which is the most famous compiler framework so that the frontend is built to be architecture-independent. In this work, we demonstrate that the proposed compiler can handle different types of CG RAs without changing the source codes. In addition, we discuss the effect of architecture-independent optimization algorithms. We also provide an open-source implementation of the compiler framework at https://github.com/hal-lab-u-tokyo/CGRAOmp.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"257 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115456396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote Talk 1: Efficient DNN Training at Scale: from Algorithms to Hardware","authors":"Gennady Pekhimenko","doi":"10.1109/IPDPSW55747.2022.00219","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00219","url":null,"abstract":"The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus of systems research is usually quite narrow and limited to inference (i.e., how to efficiently execute already trained models) and image classification networks as the primary benchmark for evaluation. In this talk, we will demonstrate a holistic approach to DNN training acceleration and scalability starting from the algorithm, to software and hardware optimizations, to special development and optimization tools.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124743556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks","authors":"Sam White, L. Kalé","doi":"10.1109/IPDPSW55747.2022.00085","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00085","url":null,"abstract":"Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122958668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Trejo-Sánchez, F. Hernández-López, Miguel Ángel Uh Zapata, J. López-Martínez, Daniel Fajardo-Delgado, J. Pacheco
{"title":"Teaching High-Performance Computing in Developing Countries: A Case Study in Mexican Universities","authors":"J. Trejo-Sánchez, F. Hernández-López, Miguel Ángel Uh Zapata, J. López-Martínez, Daniel Fajardo-Delgado, J. Pacheco","doi":"10.1109/IPDPSW55747.2022.00066","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00066","url":null,"abstract":"Teaching High-Performance Computing (HPC) to undergraduate programs represents a significant challenge in most universities in developing countries like Mexico. Deficien-cies in the required infrastructure and equipment, inadequate curricula in computer engineering programs (and resistance to change them), students' lack of interest, motivation, or knowledge of this area are the main difficulties to overcome. The COVID-19 pandemic represents an additional challenge to these difficulties in teaching HPC in these programs. Despite the detriments, some strategies have been developed to incorporate the HPC concepts to Mexican students without necessarily modifying the traditional curricula. This paper presents a case study over four public universities in Mexico based on our experience as instructors. We also propose a course that introduces the HPC principles considering the heterogeneous background of the students in such universities. The results are about the number of students enrolling in related classes and participating in extra-curricular projects.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122106345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Schedules for High-Level Programming Environments on FPGAs with Constraint Programming","authors":"Pascal Jungblut, D. Kranzlmüller","doi":"10.1109/IPDPSW55747.2022.00025","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00025","url":null,"abstract":"Scheduling tasks on reconfigurable hardware is a well-known problem. Yet, the adoption of advanced scheduling strategies for reconfigurable systems is still low. We argue that a pragmatic solution not relying on low-level features like partial reconfiguration is feasible. Our theoretical framework describes reconfigurable hardware in a simple and abstract way. The constraints of a schedule are used to derive a constraint programming formulation. We present two heuristic algorithms based on list scheduling and on clustering, respectively. The model is evaluated and compared to partial reconfiguration using parameters from a previously observed LU decomposition on an FPGA. The losses are compared to a conventional, optimal approach. It can be integrated into existing technologies to aide the adoption of high-level FPGA programming environments.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122680370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enrico Russo, M. Palesi, Davide Patti, Habiba Lahdhiri, Salvatore Monteleone, G. Ascia, V. Catania
{"title":"Combined Application of Approximate Computing Techniques in DNN Hardware Accelerators","authors":"Enrico Russo, M. Palesi, Davide Patti, Habiba Lahdhiri, Salvatore Monteleone, G. Ascia, V. Catania","doi":"10.1109/IPDPSW55747.2022.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00013","url":null,"abstract":"This paper applies Approximate Computing (AC) techniques to the main elements which form a DNN hardware accelerator, namely, computation, communication, and memory subsystems. Specifically, approximate multipliers for computation, link voltage swing reduction for communication, voltage over-scaling for the internal SRAM memory, and lossy compression of the external DRAM memory are considered. The different AC techniques are applied in isolation as well as in conjunction with each other. A set of representative CNN models are mapped onto the approximated hardware accelerators and the trade-offs performance vs. energy vs. accuracy are derived for the execution of CNN inferences.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122895617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Quantum Processor Performance through Quantum Distance Metrics Over An Algorithm Suite","authors":"S. Stein, N. Wiebe, James Ang, A. Li","doi":"10.1109/IPDPSW55747.2022.00106","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00106","url":null,"abstract":"Quantum computing is poised to solve computational paradigms that classical computing could never feasibly reach. Tasks such as prime factorization to Quantum Chemistry are examples of classically difficult problems that have analogous algorithms that are sped up on quantum computers. To attain this computational advantage, we must first traverse the noisy intermediate scale quantum (NISQ) era, in which quantum processors suffer from compounding noise factors that can lead to unreliable algorithm induction producing noisy results. We describe QASMBench, a suite of QASM-level (Quantum assembly language) benchmarks that challenge all realisable angles of quantum processor noise. We evaluate a large portion of these algorithms by performing density matrix tomography on 14 IBMQ Quantum devices.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126070846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ESSA 2022 Invited Speaker: The Curious Incident of the Data in the Scientific Workflow","authors":"L. Ramakrishnan","doi":"10.1109/IPDPSW55747.2022.00181","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00181","url":null,"abstract":"The volume, veracity, and velocity of data generated by the accelerators, colliders, supercomputers, light sources and neutron sources have grown exponentially in the last decade. Data has fundamentally changed the scientific workflow running on high performance computing (HPC) systems. It is necessary that we develop appropriate capabilities and tools to understand, analyze, preserve, share, and make optimal use of data. Intertwined with data are complex human processes, policies and decisions that need to be accounted for when building software tools. In this talk, I will outline our work addressing data lifecycle challenges on HPC systems including effective use of storage hierarchy, managing complex scientific data processing, and enabling search on large-scale scientific data.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}