Proceedings of the 25th European MPI Users' Group Meeting最新文献_第2页

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources 无专用资源的MPI高效异步通信进程

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236376

Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda

{"title":"Efficient Asynchronous Communication Progress for MPI without Dedicated Resources","authors":"Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda","doi":"10.1145/3236367.3236376","DOIUrl":"https://doi.org/10.1145/3236367.3236376","url":null,"abstract":"The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126298228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Cellular automata beyond 100k cores: MPI vs Fortran coarrays 超越 10 万核的细胞自动机：MPI 与 Fortran 数组

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236384

A. Shterenlikht, L. Cebamanos

{"title":"Cellular automata beyond 100k cores: MPI vs Fortran coarrays","authors":"A. Shterenlikht, L. Cebamanos","doi":"10.1145/3236367.3236384","DOIUrl":"https://doi.org/10.1145/3236367.3236384","url":null,"abstract":"Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is memory and network bound. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127886268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance model for mesh optimization on distributed-memory computers 分布式存储计算机网格优化的性能模型

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236372

Domingo Benítez, J. M. Escobar, R. Montenegro, E. Rodríguez

引用次数: 0

Improving the Interoperability between MPI and Task-Based Programming Models 改进MPI和基于任务的编程模型之间的互操作性

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236382

Kevin Sala, Jorge Bellón, Pau Farré, Xavier Teruel, Josep M. Pérez, Antonio J. Peña, Daniel J. Holmes, Vicencc Beltran, Jesús Labarta

{"title":"Improving the Interoperability between MPI and Task-Based Programming Models","authors":"Kevin Sala, Jorge Bellón, Pau Farré, Xavier Teruel, Josep M. Pérez, Antonio J. Peña, Daniel J. Holmes, Vicencc Beltran, Jesús Labarta","doi":"10.1145/3236367.3236382","DOIUrl":"https://doi.org/10.1145/3236367.3236382","url":null,"abstract":"In this paper we propose an API to pause and resume task execution depending on external events. We leverage this generic API to improve the interoperability between MPI synchronous communication primitives and tasks. When an MPI operation blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the MPI operation is completed, the paused task is put again on the runtime system's ready queue. We expose our proposal through a new MPI threading level which we implement through two approaches. The first approach is an MPI wrapper library that works with any MPI implementation by intercepting MPI synchronous calls, implementing them on top of their asynchronous counterparts. In this case, the task-based runtime system is also extended to periodically check for pending MPI operations and resume the corresponding tasks once MPI operations complete. The second approach consists in directly modifying the MPICH runtime system, a well-known implementation of MPI, to directly call the pause/resume API when a synchronous MPI operation blocks and completes, respectively. Our experiments reveal that this proposal not only simplifies the development of hybrid MPI+OpenMP applications that naturally overlap computation and communication phases; it also improves application performance and scalability by removing artificial dependencies across communication tasks.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131148675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

MPI Derived Datatypes: Performance and Portability Issues MPI派生数据类型:性能和可移植性问题

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236378

Qingqing Xiong, P. Bangalore, A. Skjellum, M. Herbordt

{"title":"MPI Derived Datatypes: Performance and Portability Issues","authors":"Qingqing Xiong, P. Bangalore, A. Skjellum, M. Herbordt","doi":"10.1145/3236367.3236378","DOIUrl":"https://doi.org/10.1145/3236367.3236378","url":null,"abstract":"This paper addresses performance-portability and overall performance issues when derived datatypes are used with four MPI implementations: Open MPI, MPICH, MVAPICH2, and Intel MPI. These comparisons are particularly relevant today since most vendor implementations are now based on Open MPI or MPICH rather than on vendor proprietary code as was more prevalent in the past. Our findings are that, within a single MPI implementation, there are significant differences in performance as a function of it reasonable encodings of derived datatypes as supported by the MPI standard. While this finding may not be surprising, it is important to understand how fundamental vs. arbitrary choices made in early implementation impact the use of derived datatypes to date. A more significant finding is that one cannot reliably choose a single derived datatype format and expect uniform performance portability among these four implementations. That is, the best-performing path under one of the MPI code bases is not the same as the best-performing path under another. Users have to be prepared to recode for a different formulation to move efficiently among MPICH, MVAPICH2, Intel MPI, and Open MPI. This lack of uniformity presents a significant gap in MPI's fundamental purpose of offering performance portability. Specific examination of internal implementation details indicates why performance is different among the implementations. Proposed solutions to this problem include i) revamping datatypes; ii) providing a common, underlying datatype standard used by multiple MPI implementations; and iii) exploring new ways to describe derived datatypes that are optimizable by modern networks and faster than MPI implementations' software-based marshaling and unmarshaling.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129458559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Supporting MPI-distributed stream parallel patterns in GrPPI 在GrPPI中支持mpi分布式流并行模式

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236380

Javier Fernández Muñoz, M. F. Dolz, David del Rio Astorga, Javier Prieto Cepeda, José Daniel García Sánchez

引用次数: 3

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance 利用仿真方法研究MPI消息匹配成本对应用程序性能的影响

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236375

Scott Levy, Kurt B. Ferreira

{"title":"Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3236367.3236375","DOIUrl":"https://doi.org/10.1145/3236367.3236375","url":null,"abstract":"Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Improving Performance Models for Irregular Point-to-Point Communication 改进不规则点对点通信的性能模型

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-06-06 DOI: 10.1145/3236367.3236368

Amanda Bienz, W. Gropp, Luke N. Olson

引用次数: 11

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? 密集gpu InfiniBand集群上深度学习工作负载的优化广播:MPI还是NCCL?

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2017-07-28 DOI: 10.1145/3236367.3236381

A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda

{"title":"Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda","doi":"10.1145/3236367.3236381","DOIUrl":"https://doi.org/10.1145/3236367.3236381","url":null,"abstract":"Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134077111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41