Rodrigo Ceccato, H. Yviquel, M. Pereira, Alan Souza, G. Araújo
{"title":"Implementing the Broadcast Operation in a Distributed Task-based Runtime","authors":"Rodrigo Ceccato, H. Yviquel, M. Pereira, Alan Souza, G. Araújo","doi":"10.1109/SBAC-PADW56527.2022.00014","DOIUrl":"https://doi.org/10.1109/SBAC-PADW56527.2022.00014","url":null,"abstract":"Scientific applications that require high performance rely on multi-node and multi-core systems equipped with accelerators. Code for these heterogeneous architectures often mixes different programming paradigms and is hard to read and maintain. Task-based distributed runtimes can improve portability and readability by allowing programmers to write tasks that are automatically scheduled and offloaded for execution. Nevertheless, in large systems, communication can dominate the time spent during execution. To mitigate this, these systems usually implement collective operation algorithms that efficiently execute common data movement patterns across a group of processes. This work studies the usage of different broadcast strategies on the OpenMP Cluster (OMPC) task-based runtime. In addition to OMPC default behavior of on-demand data delivery, we introduce a routine to automatically detect data movement that is equivalent to a broadcast in the task graph and actively send it through a specialized algorithm. Our largest test setup using 64 worker nodes and broadcasting 64GB of data on the Santos Dumont cluster, using an extended version of Task Bench, showed a 2.02x speedup using the Dynamic Broadcast algorithm and 2.49x speedup when using the MPI broadcast routine, when compared to the default on-demand delivery.","PeriodicalId":263889,"journal":{"name":"2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130373704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carla Cardoso, H. Yviquel, G. Valarini, Gustavo Leite, Rodrigo Ceccato, M. Pereira, Alan Souza, G. Araújo
{"title":"An OpenMP-only Linear Algebra Library for Distributed Architectures","authors":"Carla Cardoso, H. Yviquel, G. Valarini, Gustavo Leite, Rodrigo Ceccato, M. Pereira, Alan Souza, G. Araújo","doi":"10.1109/SBAC-PADW56527.2022.00013","DOIUrl":"https://doi.org/10.1109/SBAC-PADW56527.2022.00013","url":null,"abstract":"This paper presents a dense linear algebra library for distributed memory systems called OMPC PLASMA. It leverages the OpenMP Cluster (OMPC) programming model to enable the execution of the PLASMA library using task parallelism on a distributed cluster architecture. OpenMP Cluster model is used to define the task regions that are then distribute across the cluster nodes by the OMPC runtime that automatically manages task scheduling, communications between nodes, and fault tolerance. The OMPC PLASMA library modifies various PLASMA functions to distribute the matrix across the nodes and perform the calculation using threads of the node. Experimental results show that OMPC PLASMA achieves 4.00x with 4 worker nodes, 7.00x with 8 worker nodes, and 12.00x with 16 worker nodes acceleration over its original implementation for a single node. A 3.00x speedup is achieved when comparing OMPC PLASMA execution to ScaLAPACK, for 4 worker nodes, and a matrix size of 90k×90k.","PeriodicalId":263889,"journal":{"name":"2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Boito, A. T. Gomes, Louis Peyrondet, Luan Teylo
{"title":"I/O performance of multiscale finite element simulations on HPC environments","authors":"F. Boito, A. T. Gomes, Louis Peyrondet, Luan Teylo","doi":"10.1109/SBAC-PADW56527.2022.00012","DOIUrl":"https://doi.org/10.1109/SBAC-PADW56527.2022.00012","url":null,"abstract":"In this paper, we present MSLIO, a code to mimic the I/O behavior of multiscale simulations. Such an I/O kernel is useful for HPC research, as it can be executed more easily and more efficiently than the full simulations when researchers are interested in the I/O load only. We validate MSLIO by comparing it to the I/O performance of an actual simulation, and we then use it to test some possible improvements to the output routine of the MHM (Multiscale Hybrid Mixed) library.","PeriodicalId":263889,"journal":{"name":"2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122286304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}