Proceedings of the 25th European MPI Users' Group Meeting最新文献

筛选
英文 中文
Efficient Asynchronous Communication Progress for MPI without Dedicated Resources 无专用资源的MPI高效异步通信进程
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236376
Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda
{"title":"Efficient Asynchronous Communication Progress for MPI without Dedicated Resources","authors":"Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda","doi":"10.1145/3236367.3236376","DOIUrl":"https://doi.org/10.1145/3236367.3236376","url":null,"abstract":"The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126298228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Cellular automata beyond 100k cores: MPI vs Fortran coarrays 超越 10 万核的细胞自动机:MPI 与 Fortran 数组
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236384
A. Shterenlikht, L. Cebamanos
{"title":"Cellular automata beyond 100k cores: MPI vs Fortran coarrays","authors":"A. Shterenlikht, L. Cebamanos","doi":"10.1145/3236367.3236384","DOIUrl":"https://doi.org/10.1145/3236367.3236384","url":null,"abstract":"Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is memory and network bound. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127886268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Performance model for mesh optimization on distributed-memory computers 分布式存储计算机网格优化的性能模型
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236372
Domingo Benítez, J. M. Escobar, R. Montenegro, E. Rodríguez
{"title":"Performance model for mesh optimization on distributed-memory computers","authors":"Domingo Benítez, J. M. Escobar, R. Montenegro, E. Rodríguez","doi":"10.1145/3236367.3236372","DOIUrl":"https://doi.org/10.1145/3236367.3236372","url":null,"abstract":"Many mesh optimization applications are based on vertex repositioning algorithms (VrPA). The execution times of these numerical algorithms vary widely, usually with a trade-off between different parameters. In this work, we analyze the impacts of six parameters of sequential VrPA on runtime. Our analysis is used to propose a new workload measure called number of mesh element evaluations. Since the execution time required for VrPA programs may be too large and there is concurrency in processing mesh elements, parallelism has been used to improve performance efficiently. The performance model is extended to parallel VrPA algorithms that are implemented in MPI. This model has been validated using two Open MPI versions on two distributed-memory computers and is the basis for the quantitative analysis of performance scalability, load balancing and synchronization and communication overheads. Finally, a new approach to mesh partitioning that improves load balancing is proposed.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"123 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133356307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the Interoperability between MPI and Task-Based Programming Models 改进MPI和基于任务的编程模型之间的互操作性
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236382
Kevin Sala, Jorge Bellón, Pau Farré, Xavier Teruel, Josep M. Pérez, Antonio J. Peña, Daniel J. Holmes, Vicencc Beltran, Jesús Labarta
{"title":"Improving the Interoperability between MPI and Task-Based Programming Models","authors":"Kevin Sala, Jorge Bellón, Pau Farré, Xavier Teruel, Josep M. Pérez, Antonio J. Peña, Daniel J. Holmes, Vicencc Beltran, Jesús Labarta","doi":"10.1145/3236367.3236382","DOIUrl":"https://doi.org/10.1145/3236367.3236382","url":null,"abstract":"In this paper we propose an API to pause and resume task execution depending on external events. We leverage this generic API to improve the interoperability between MPI synchronous communication primitives and tasks. When an MPI operation blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the MPI operation is completed, the paused task is put again on the runtime system's ready queue. We expose our proposal through a new MPI threading level which we implement through two approaches. The first approach is an MPI wrapper library that works with any MPI implementation by intercepting MPI synchronous calls, implementing them on top of their asynchronous counterparts. In this case, the task-based runtime system is also extended to periodically check for pending MPI operations and resume the corresponding tasks once MPI operations complete. The second approach consists in directly modifying the MPICH runtime system, a well-known implementation of MPI, to directly call the pause/resume API when a synchronous MPI operation blocks and completes, respectively. Our experiments reveal that this proposal not only simplifies the development of hybrid MPI+OpenMP applications that naturally overlap computation and communication phases; it also improves application performance and scalability by removing artificial dependencies across communication tasks.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131148675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
MPI Derived Datatypes: Performance and Portability Issues MPI派生数据类型:性能和可移植性问题
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236378
Qingqing Xiong, P. Bangalore, A. Skjellum, M. Herbordt
{"title":"MPI Derived Datatypes: Performance and Portability Issues","authors":"Qingqing Xiong, P. Bangalore, A. Skjellum, M. Herbordt","doi":"10.1145/3236367.3236378","DOIUrl":"https://doi.org/10.1145/3236367.3236378","url":null,"abstract":"This paper addresses performance-portability and overall performance issues when derived datatypes are used with four MPI implementations: Open MPI, MPICH, MVAPICH2, and Intel MPI. These comparisons are particularly relevant today since most vendor implementations are now based on Open MPI or MPICH rather than on vendor proprietary code as was more prevalent in the past. Our findings are that, within a single MPI implementation, there are significant differences in performance as a function of it reasonable encodings of derived datatypes as supported by the MPI standard. While this finding may not be surprising, it is important to understand how fundamental vs. arbitrary choices made in early implementation impact the use of derived datatypes to date. A more significant finding is that one cannot reliably choose a single derived datatype format and expect uniform performance portability among these four implementations. That is, the best-performing path under one of the MPI code bases is not the same as the best-performing path under another. Users have to be prepared to recode for a different formulation to move efficiently among MPICH, MVAPICH2, Intel MPI, and Open MPI. This lack of uniformity presents a significant gap in MPI's fundamental purpose of offering performance portability. Specific examination of internal implementation details indicates why performance is different among the implementations. Proposed solutions to this problem include i) revamping datatypes; ii) providing a common, underlying datatype standard used by multiple MPI implementations; and iii) exploring new ways to describe derived datatypes that are optimizable by modern networks and faster than MPI implementations' software-based marshaling and unmarshaling.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129458559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Supporting MPI-distributed stream parallel patterns in GrPPI 在GrPPI中支持mpi分布式流并行模式
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236380
Javier Fernández Muñoz, M. F. Dolz, David del Rio Astorga, Javier Prieto Cepeda, José Daniel García Sánchez
{"title":"Supporting MPI-distributed stream parallel patterns in GrPPI","authors":"Javier Fernández Muñoz, M. F. Dolz, David del Rio Astorga, Javier Prieto Cepeda, José Daniel García Sánchez","doi":"10.1145/3236367.3236380","DOIUrl":"https://doi.org/10.1145/3236367.3236380","url":null,"abstract":"In the recent years, the large volumes of stream data and the near real-time requirements of data streaming applications have exacerbated the need for new scalable algorithms and programming interfaces for distributed and shared-memory platforms. To contribute in this direction, this paper presents a new distributed MPI back end for GrPPI, a C++ high-level generic interface of data-intensive and stream processing parallel patterns. This back end, as a new execution policy, supports the distributed and hybrid (distributed and shared-memory) parallel execution of the Pipeline and Farm patterns, where the hybrid mode combines the MPI policy with a GrPPI shared-memory one. A detailed analysis of the GrPPI MPI execution policy reports considerable benefits from the programmability, flexibility and readability points of view. The experimental evaluation on a streaming application with different distributed and shared-memory scenarios reports considerable performance gains with respect to the sequential versions at the expense of negligible GrPPI overheads.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127587110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance 利用仿真方法研究MPI消息匹配成本对应用程序性能的影响
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI: 10.1145/3236367.3236375
Scott Levy, Kurt B. Ferreira
{"title":"Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3236367.3236375","DOIUrl":"https://doi.org/10.1145/3236367.3236375","url":null,"abstract":"Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving Performance Models for Irregular Point-to-Point Communication 改进不规则点对点通信的性能模型
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-06-06 DOI: 10.1145/3236367.3236368
Amanda Bienz, W. Gropp, Luke N. Olson
{"title":"Improving Performance Models for Irregular Point-to-Point Communication","authors":"Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3236367.3236368","DOIUrl":"https://doi.org/10.1145/3236367.3236368","url":null,"abstract":"Parallel applications are often unable to take full advantage of emerging parallel architectures due to scaling limitations, which arise due to inter-process communication. Performance models are used to analyze the sources of communication costs. However, traditional models for point-to-point communication fail to capture the full cost of many irregular operations, such as sparse matrix methods. In this paper, a node-aware based model is presented. Furthermore, the model is extended to include communication queue search time as well as an additional parameter estimating network contention. The resulting model is applied to a variety of irregular communication patterns throughout matrix operations, displaying improved accuracy over traditional models.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125049161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? 密集gpu InfiniBand集群上深度学习工作负载的优化广播:MPI还是NCCL?
Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2017-07-28 DOI: 10.1145/3236367.3236381
A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda
{"title":"Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda","doi":"10.1145/3236367.3236381","DOIUrl":"https://doi.org/10.1145/3236367.3236381","url":null,"abstract":"Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134077111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信