Proceedings of the 22nd European MPI Users' Group Meeting最新文献

Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH PARCOACH中MPI-3非阻塞通信的正确性分析

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802674

Julien Jaeger, Emmanuelle Saillard, Patrick Carribault, Denis Barthou

引用次数: 10

STCI: Scalable RunTime Component Infrastructure STCI:可伸缩的运行时组件基础设施

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802675

Geoffroy R. Vallée, D. Bernholdt, S. Böhm, T. Naughton

引用次数: 0

Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery 计划B:中断正在进行的MPI操作以支持故障恢复

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802668

Aurélien Bouteiller, G. Bosilca, J. Dongarra

引用次数: 11

Specification Guideline Violations by MPI_Dims_create MPI_Dims_create违反规范准则

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802677

J. Träff, F. Lübbe

引用次数: 8

Efficient, Optimal MPI Datatype Reconstruction for Vector and Index Types 高效，最优的MPI数据类型重建向量和索引类型

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802671

Martin Kalany, J. Träff

{"title":"Efficient, Optimal MPI Datatype Reconstruction for Vector and Index Types","authors":"Martin Kalany, J. Träff","doi":"10.1145/2802658.2802671","DOIUrl":"https://doi.org/10.1145/2802658.2802671","url":null,"abstract":"Type reconstruction is the process of finding an efficient representation in terms of space and processing time of a data layout as an MPI derived datatype. Practically efficient type reconstruction and normalization is important for high-quality MPI implementations that strive to provide good performance for communication operations involving noncontiguous data. Although it has recently been shown that the general problem of computing optimal tree representations of derived datatypes allowing any of the MPI derived datatype constructors can be solved in polynomial time, the algorithm for this may unfortunately be impractical for datatypes with large counts. By restricting the allowed constructors to vector and index-block type constructors, but excluding the most general MPI_Type_create_struct constructor, the problem can be solved much more efficiently. More precisely, we give a new O(n log n/log log n) time algorithm for finding cost-optimal representations of MPI type maps of length n using only vector and index-block constructors for a simple but flexible, additive cost model. This improves significantly over a previous O(n√n) time algorithm for the same problem, and the algorithm is simple enough to be considered for practical MPI libraries.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125604744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations 并行模板计算的同构、稀疏类mpi集体通信操作

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802663

J. Träff, F. Lübbe, Antoine Rougier, S. Hunold

{"title":"Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations","authors":"J. Träff, F. Lübbe, Antoine Rougier, S. Hunold","doi":"10.1145/2802658.2802663","DOIUrl":"https://doi.org/10.1145/2802658.2802663","url":null,"abstract":"We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning MPI顾问:MPI库性能调优的最小开销工具

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802667

E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne

{"title":"MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning","authors":"E. Gallardo, Jérôme Vienne, L. Fialho, P. Teller, J. Browne","doi":"10.1145/2802658.2802667","DOIUrl":"https://doi.org/10.1145/2802658.2802667","url":null,"abstract":"A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126794703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

DAME: A Runtime-Compiled Engine for Derived Datatypes DAME:用于派生数据类型的运行时编译引擎

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802659

Tarun Prabhu, W. Gropp

{"title":"DAME: A Runtime-Compiled Engine for Derived Datatypes","authors":"Tarun Prabhu, W. Gropp","doi":"10.1145/2802658.2802659","DOIUrl":"https://doi.org/10.1145/2802658.2802659","url":null,"abstract":"In order to achieve high performance on modern and future machines, applications need to make effective use of the complex, hierarchical memory system. Writing performance-portable code continues to be challenging since each architecture has unique memory access characteristics. In addition, some optimization decisions can only reasonably be made at runtime. This suggests that a two-pronged approach to address the challenge is required. First, provide the programmer with a means to express memory operations declaratively which will allow a runtime system to transparently access the memory in the best way and second, exploit runtime information. MPI's derived datatypes accomplish the former although their performance in current MPI implementations shows scope for improvement. JIT-compilation can be used for the latter. In this work, we present DAME --- a language and interpreter that is used as the backend for MPI's derived datatypes. We also present DAME-L and DAME-X, two JIT-enabled implementations of DAME. All three implementations have been integrated into MPICH. We evaluate the performance of our implementations using DDTBench and two mini-applications written with MPI derived datatypes and obtain communication speedups of up to 20x and mini-application speedup of 3x.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132139821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors OpenFOAM*与MPI-3 RMA例程在Intel®Xeon®处理器和Intel®Xeon Phi™协处理器上的性能评估

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802676

Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak

{"title":"Performance Evaluation of OpenFOAM* with MPI-3 RMA Routines on Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors","authors":"Nishant Agrawal, Paul Edwards, Ambuj Pandey, Michael Klemm, Ravi Ojha, R. A. Razak","doi":"10.1145/2802658.2802676","DOIUrl":"https://doi.org/10.1145/2802658.2802676","url":null,"abstract":"OpenFOAM is a software package for solving partial differential equations and is very popular for computational fluid dynamics in the automotive segment. In this work, we describe our evaluation of the performance of OpenFOAM with MPI-3 Remote Memory Access (RMA) one-sided communication on the Intel® Xeon Phi\" coprocessor. Currently, OpenFOAM computes on a mesh that is decomposed among different MPI ranks, and it requires a high amount of communication between the neighboring ranks. MPI-3 offers RMA through a new API that decouples communication and synchronization. The aim is to achieve better performance with MPI-3 RMA routines as compared to the current two-sided asynchronous communication routines in OpenFOAM. We also showcase the challenges overcome in order to facilitate the different MPI-3 RMA routines in OpenFOAM. This discussion aims at analyzing the potential of MPI-3 RMA in OpenFOAM and benchmarking the performance on both the processor and the coprocessor. Our work also demonstrates that MPI-3 RMA in OpenFOAM can run in symmetric mode consisting of the Intel® Xeon® E5-2697v3 processor and the Intel® Xeon Phi™ 7120P coprocessor.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives 论时钟和进程同步对MPI集体基准测试的影响

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI: 10.1145/2802658.2802662

S. Hunold, Alexandra Carpen-Amarie

{"title":"On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives","authors":"S. Hunold, Alexandra Carpen-Amarie","doi":"10.1145/2802658.2802662","DOIUrl":"https://doi.org/10.1145/2802658.2802662","url":null,"abstract":"We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"07 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12