J. Trejo-Sánchez, F. Hernández-López, Miguel Ángel Uh Zapata, J. López-Martínez, Daniel Fajardo-Delgado, J. Pacheco
{"title":"Teaching High-Performance Computing in Developing Countries: A Case Study in Mexican Universities","authors":"J. Trejo-Sánchez, F. Hernández-López, Miguel Ángel Uh Zapata, J. López-Martínez, Daniel Fajardo-Delgado, J. Pacheco","doi":"10.1109/IPDPSW55747.2022.00066","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00066","url":null,"abstract":"Teaching High-Performance Computing (HPC) to undergraduate programs represents a significant challenge in most universities in developing countries like Mexico. Deficien-cies in the required infrastructure and equipment, inadequate curricula in computer engineering programs (and resistance to change them), students' lack of interest, motivation, or knowledge of this area are the main difficulties to overcome. The COVID-19 pandemic represents an additional challenge to these difficulties in teaching HPC in these programs. Despite the detriments, some strategies have been developed to incorporate the HPC concepts to Mexican students without necessarily modifying the traditional curricula. This paper presents a case study over four public universities in Mexico based on our experience as instructors. We also propose a course that introduces the HPC principles considering the heterogeneous background of the students in such universities. The results are about the number of students enrolling in related classes and participating in extra-curricular projects.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122106345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ESSA 2022 Invited Speaker: The Curious Incident of the Data in the Scientific Workflow","authors":"L. Ramakrishnan","doi":"10.1109/IPDPSW55747.2022.00181","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00181","url":null,"abstract":"The volume, veracity, and velocity of data generated by the accelerators, colliders, supercomputers, light sources and neutron sources have grown exponentially in the last decade. Data has fundamentally changed the scientific workflow running on high performance computing (HPC) systems. It is necessary that we develop appropriate capabilities and tools to understand, analyze, preserve, share, and make optimal use of data. Intertwined with data are complex human processes, policies and decisions that need to be accounted for when building software tools. In this talk, I will outline our work addressing data lifecycle challenges on HPC systems including effective use of storage hierarchy, managing complex scientific data processing, and enabling search on large-scale scientific data.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
{"title":"Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00014","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00014","url":null,"abstract":"In recent years, High Performance Computing (HPC) and Deep Learning (DL) applications have been modified to run on top supercomputers and utilize the high compute power of GPUs. While GPUs provide high computational power, communication of data between GPUs and across a network continues to be a bottleneck. In particular, with the increasing amount of FFT compute and sparse matrix transpose operations in these applications, Alltoall MPI collective operations are heavily used. Alltoall communication is considered the heaviest communication pattern compared to other MPI collective calls. Few techniques and algorithms effectively help in optimizing Alltoall communication, much less improving the performance on a dense GPU cluster while exploiting the features of modern inter-connects and topologies. Despite the introduction of Inter-Process Communication (IPC) in CUDA 4.1 by NVIDIA, state-of-the-art MPI libraries have not utilized these IPC-based mechanisms to design novel Alltoall algorithms that exploit the capabilities of modern GPUs. In this paper, we propose hybrid IPC-advanced designs for Alltoall and Alltoallv communication on novel GPU systems. By utilizing zero-copy load-store IPC mechanisms for multi-GPU communication within a node, we are able to overlap the intra-node and inter-node communication, yielding improved performance on GPU systems. We evaluate the benefits of our designs at the benchmark and application layers on the ThetaGPU system at ALCF and the Lassen system at LLNL. Our designs provide up to 13.5x and 71% improvements on 128 GPUs and 64 GPUs at the benchmark-level over state-of-the-art MPI libraries on ThetaGPU and Lassen respectively. At the application level, our designs have up to 59x performance improvement for an HPC application, heFFTe, and 5.7x performance improvement for a Deep Learning application, DeepSpeed, on 64 GPUs on ThetaGPU and 256 GPUs on Lassen.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129290473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Quantum Processor Performance through Quantum Distance Metrics Over An Algorithm Suite","authors":"S. Stein, N. Wiebe, James Ang, A. Li","doi":"10.1109/IPDPSW55747.2022.00106","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00106","url":null,"abstract":"Quantum computing is poised to solve computational paradigms that classical computing could never feasibly reach. Tasks such as prime factorization to Quantum Chemistry are examples of classically difficult problems that have analogous algorithms that are sped up on quantum computers. To attain this computational advantage, we must first traverse the noisy intermediate scale quantum (NISQ) era, in which quantum processors suffer from compounding noise factors that can lead to unreliable algorithm induction producing noisy results. We describe QASMBench, a suite of QASM-level (Quantum assembly language) benchmarks that challenge all realisable angles of quantum processor noise. We evaluate a large portion of these algorithms by performing density matrix tomography on 14 IBMQ Quantum devices.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126070846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous GPU and FPGA computing: a VexCL case-study","authors":"Tristan Laan, A. Varbanescu","doi":"10.1109/IPDPSW55747.2022.00073","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00073","url":null,"abstract":"FPGA-based accelerators are capturing the interest of the HPC domain, primarily due to their superior energy-efficiency compared to more common accelerators, like GPUs. However, enabling HPC codes to use FPGA-based accelerators (efficiently) remains a difficult task. One interesting, fast-track solution to this problem is to extend the domain-specific, high-level languages, libraries, or APIs that already support other accelerators (e.g., GPUs) to target FPGAs. In this work we demonstrate the added value of such an approach by adding FPGA support to VexCL, a vector expression template library for OpenCL/CUDA. To this end, we use the VexCL-generated OpenCL code as intermediate representation, while creating code-skeletons to implement the FPGA code and all necessary data links between the host and accelerator. We further support five generic optimizations for the FPGA code. We demonstrate our approach on two use-cases, an affine transformation and an SpMV calculation, showcasing the performance and energy consumption of the resulting FPGA versions. We further demonstrate that the FPGA code can outperform the VexCL-generated GPU version. To illustrate the integration of GPU and FPGA code, we also demonstrate the performance of an VexCL SpMV application using a heterogeneous GPU+FPGA system. Our results indicate that, indeed, the integration of the two accelerators is seamless. Performance-wise, however, the heterogeneous version does not outperform the FPGA-only one.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126599697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGRA4HPC 2022 Invited Speaker: Mapping ML to the AMD/Xilinx AIE-ML architecture","authors":"Elliott Delaye","doi":"10.1109/IPDPSW55747.2022.00109","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00109","url":null,"abstract":"In the field of compute acceleration, machine learning model acceleration is one of the fastest growing areas of focus. ML model complexity in both compute and memory have driven the latest accelerator architectures and with that, developing ways to efficiently use these new architectures is the key to unlocking their potential. At AMD the AIE-ML architecture is our second generation AI-Engine architecture and we will dive into some of the ways we map the most important ML compute/bandwidth requirements to this architecture.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127695322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathan C Frey, Dan Zhao, Simon Axelrod, Michael Jones, David Bestor, V. Gadepally, Rafael Gómez-Bombarelli, S. Samsi
{"title":"Energy-aware neural architecture selection and hyperparameter optimization","authors":"Nathan C Frey, Dan Zhao, Simon Axelrod, Michael Jones, David Bestor, V. Gadepally, Rafael Gómez-Bombarelli, S. Samsi","doi":"10.1109/IPDPSW55747.2022.00125","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00125","url":null,"abstract":"Artificial Intelligence (AI) and Deep Learning in particular have increasing computational requirements, with a corresponding increase in energy consumption. There is a tremendous opportunity to reduce the computational cost and environmental impact of deep learning by accelerating neural network architecture search and hyperparameter optimization, as well as explicitly designing neural architectures that optimize for both energy efficiency and performance. Here, we introduce a framework called training performance estimation (TPE), which builds upon existing techniques for training speed estimation in order to monitor energy consumption and rank model performance-without training models to convergence-saving up to 90% of time and energy of the full training budget. We benchmark TPE in the computationally intensive, well-studied domain of computer vision and in the emerging field of graph neural networks for machine-learned inter-atomic potentials, an important domain for scientific discovery with heavy computational demands. We propose variants of early stopping that generalize this common regularization technique to account for energy costs and study the energy costs of deploying increasingly complex, knowledge-informed architectures for AI-accelerated molecular dynamics and image classification. Our work enables immediate, significant energy savings across the entire pipeline of model development and deployment and suggests new research directions for energy-aware, knowledge-informed model architecture development.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114163100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Schedules for High-Level Programming Environments on FPGAs with Constraint Programming","authors":"Pascal Jungblut, D. Kranzlmüller","doi":"10.1109/IPDPSW55747.2022.00025","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00025","url":null,"abstract":"Scheduling tasks on reconfigurable hardware is a well-known problem. Yet, the adoption of advanced scheduling strategies for reconfigurable systems is still low. We argue that a pragmatic solution not relying on low-level features like partial reconfiguration is feasible. Our theoretical framework describes reconfigurable hardware in a simple and abstract way. The constraints of a schedule are used to derive a constraint programming formulation. We present two heuristic algorithms based on list scheduling and on clustering, respectively. The model is evaluated and compared to partial reconfiguration using parameters from a previously observed LU decomposition on an FPGA. The losses are compared to a conventional, optimal approach. It can be integrated into existing technologies to aide the adoption of high-level FPGA programming environments.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122680370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard
{"title":"Decentralized in-order execution of a sequential task-based code for shared-memory architectures","authors":"Charly Castes, E. Agullo, Olivier Aumage, Emmanuelle Saillard","doi":"10.1109/IPDPSW55747.2022.00095","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00095","url":null,"abstract":"The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with advanced runtime systems allow the programmer to write a sequential algorithm independently of the hardware architecture in a productive and portable manner, and let a third party software layer -the runtime system- deal with the burden of scheduling a correct, parallel execution of that algorithm to ensure performance. Many HPC algorithms have successfully been implemented following this paradigm, as a testimony of its effectiveness. Developing algorithms that specifically require fine-grained tasks along this model is still considered prohibitive, however, due to per-task management overhead [1], forcing the programmer to resort to a less abstract, and hence more complex “task+X” model. We thus investigate the possibility to offer a tailored execution model, trading dynamic mapping for efficiency by using a decentralized, conservative in-order execution of the task flow, while preserving the benefits of relying on the sequential task-based programming model. We propose a formal specification of the execution model as well as a prototype implementation, which we assess on a shared-memory multicore architecture with several synthetic workloads. The results show that under the condition of a proper task mapping supplied by the programmer, the pressure on the runtime system is significantly reduced and the execution of fine-grained task flows is much more efficient.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120898183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks","authors":"Sam White, L. Kalé","doi":"10.1109/IPDPSW55747.2022.00085","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00085","url":null,"abstract":"Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122958668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}