Proceedings of the 27th European MPI Users' Group Meeting最新文献

Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations 纤维不是(P)线程:异步编程模型和MPI通过延续松散耦合的情况

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416320

Joseph Schuchart, Christoph Niethammer, J. Gracia

{"title":"Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations","authors":"Joseph Schuchart, Christoph Niethammer, J. Gracia","doi":"10.1145/3416315.3416320","DOIUrl":"https://doi.org/10.1145/3416315.3416320","url":null,"abstract":"Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. Meanwhile, new low-level implementations of light-weight, cooperatively scheduled execution contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis for higher-level APMs and their integration in MPI implementations has been proposed as a replacement for traditional POSIX thread support to alleviate these challenges. In this paper, we first establish a taxonomy in an attempt to clearly distinguish different concepts in the parallel software stack. We argue that the proposed tight integration of fiber implementations with MPI is neither warranted nor beneficial and instead is detrimental to the goal of MPI being a portable communication abstraction. We propose MPI Continuations as an extension to the MPI standard to provide callback-based notifications on completed operations, leading to a clear separation of concerns by providing a loose coupling mechanism between MPI and APMs. We show that this interface is flexible and interacts well with different APMs, namely OpenMP detached tasks, OmpSs-2, and Argobots.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115485414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

MPI Detach - Asynchronous Local Completion MPI分离-异步本地完成

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416323

Joachim Protze, Marc-André Hermanns, A. C. Demiralp, Matthias S. Müller, T. Kuhlen

引用次数: 7

Communication and Timing Issues with MPI Virtualization MPI虚拟化的通信和定时问题

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416317

Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler

{"title":"Communication and Timing Issues with MPI Virtualization","authors":"Alexandr Nigay, L. Mosimann, Timo Schneider, T. Hoefler","doi":"10.1145/3416315.3416317","DOIUrl":"https://doi.org/10.1145/3416315.3416317","url":null,"abstract":"Computation–communication overlap and good load balance are features central to high performance of parallel programs. Unfortunately, achieving them with MPI requires considerably increasing the complexity of user code. Our work contributes to the alternative solution to this problem: using a virtualized MPI implementation. Virtualized MPI implementations diverge from traditional MPI implementations in that they map MPI processes to user-level threads instead of operating-system processes and launch more of them than there are CPU cores in the system. They are capable of providing automatic computation–communication overlap and load balance with little to no changes to pre-existing MPI user code. Our work has uncovered new insights into MPI virtualization: Two new kinds of timers are needed: an MPI-process timer and a CPU-core timer, the same discussion also applies to performance counters and the MPI profiling interface. We also observe an interplay between the degree of CPU oversubscription and the rendezvous communication protocol: we find that the intuitive expectation of only two MPI processes per CPU core being enough to achieve full computation–communication overlap is wrong for the rendezvous protocol—instead, three MPI processes per CPU core are required in that case. Our findings are expected to be applicable to all virtualized MPI implementations as well as to general tasking runtime systems.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122261301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Using Advanced Vector Extensions AVX-512 for MPI Reductions 使用先进的矢量扩展AVX-512的MPI减少

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416316

Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra

{"title":"Using Advanced Vector Extensions AVX-512 for MPI Reductions","authors":"Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra","doi":"10.1145/3416315.3416316","DOIUrl":"https://doi.org/10.1145/3416315.3416316","url":null,"abstract":"As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116098364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Signature Datatypes for Type Correct Collective Operations, Revisited 类型正确集合操作的签名数据类型，重访

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416324

J. Träff

{"title":"Signature Datatypes for Type Correct Collective Operations, Revisited","authors":"J. Träff","doi":"10.1145/3416315.3416324","DOIUrl":"https://doi.org/10.1145/3416315.3416324","url":null,"abstract":"In order to provide for type correct implementations of applications in MPI that use derived datatypes to describe complex and possibly heterogeneous data layouts, signature datatypes describing the sequence of basic datatypes comprising the complex data layout in a compact manner have often been proposed and used to communicate and store such data in a type correct way. Signature datatypes are particularly useful in implementations of algorithms for collective communication employing pipelining and/or message-combining. We (re)examine the properties that signature datatypes must fulfill, and the properties of the MPI collective interfaces that guarantee the existence of proper signature datatypes. The analysis reveals that MPI_Alltoallw does not have the property, and thus that certain non-trivial, type correct implementations of this operation are not easily possible within MPI itself. We observe that the signature datatype for any derived datatype can be computed in O(n) operations in the number of elements n described by the derived datatype. While this improves on certain earlier approaches, this is still not a satisfactory solution for the cases where large layouts are described by small, derived datatypes. We explain how signature type computation is implemented in a library for advanced datatype programming.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117268958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Evaluating MPI Message Size Summary Statistics 评估MPI消息大小汇总统计

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416322

Kurt B. Ferreira, Scott Levy

{"title":"Evaluating MPI Message Size Summary Statistics","authors":"Kurt B. Ferreira, Scott Levy","doi":"10.1145/3416315.3416322","DOIUrl":"https://doi.org/10.1145/3416315.3416322","url":null,"abstract":"The Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on today’s high-performance computing (HPC) systems. This dominance stems from MPI’s powerful semantics for inter-process communication that has enabled scientists to write applications for simulating important physical phenomena. MPI does not, however, specify how messages and synchronization should be carried out. Those details are typically dependent on low-level architecture details and the message characteristics of the application. Therefore, analyzing an applications MPI usage is critical to tuning MPI’s performance on a particular platform. The results of this analysis is typically a discussion of average message sizes for a workload or set of workloads. While a discussion of the message average might be the most intuitive summary statistic, it might not be the most useful in terms of representing the entire message size dataset for an application. Using a previously developed MPI trace collector, we analyze the MPI message traces for a number of key MPI workloads. Through this analysis, we demonstrate that the average, while easy and efficient to calculate, may not be a good representation of all subsets of application messages sizes, with median and mode of message sizes being a superior choice in most cases. We show that the problem with using the average relate to the multi-modal nature of the distribution of point-to-point messages. Finally, we show that while scaling a workload has little discernible impact on which measures of central tendency are representative of the underlying data, different input descriptions can significantly impact which metric is most effective. The results and analysis in this paper have the potential for providing valuable guidance on how we as a community should discuss and analyze MPI message data for scientific applications.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Collectives and Communicators: A Case for Orthogonality: (Or: How to get rid of MPI neighbor and enhance Cartesian collectives) 集体与传播者:正交性的一种情况(或:如何摆脱MPI邻居，增强笛卡尔集体)

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416319

J. Träff, S. Hunold, Guillaume Mercier, Daniel J. Holmes

{"title":"Collectives and Communicators: A Case for Orthogonality: (Or: How to get rid of MPI neighbor and enhance Cartesian collectives)","authors":"J. Träff, S. Hunold, Guillaume Mercier, Daniel J. Holmes","doi":"10.1145/3416315.3416319","DOIUrl":"https://doi.org/10.1145/3416315.3416319","url":null,"abstract":"A major reason for the success of MPI as the standard for large-scale, distributed memory programming is the economy and orthogonality of key concepts. These very design principles suggest leaner and better support for stencil-like, sparse collective communication, while at the same time reducing significantly the number of concrete operation interfaces, extending the functionality that can be supported by high-quality MPI implementations, and provisioning for possible future, much more wide-ranging functionality. As a starting point for discussion, we suggest to (re)define communicators as the sole carriers of the topological structure over processes that determines the semantics of the collective operations, and to limit the functions that can associate topological information with communicators to the functions for distributed graph topology and inter-communicator creation. As a consequence, one set of interfaces for collective communication operations (in blocking, non-blocking, and persistent variants) will suffice, explicitly eliminating the MPI_Neighbor_ interfaces (in all variants) from the MPI standard. Topological structure will not be implied by Cartesian communicators, which in turn will have the sole function of naming processes in a (d-dimensional, Euclidean) geometric space. The geometric naming can be passed to the topology creating functions as part of the communicator, and be used for the process reordering and topological collective algorithm selection. Concretely, at the price of only 1 essential, additional function, our suggestion can remove 10(+1) function interfaces from MPI-3, and 15 (or more) from MPI-4, while providing vastly more optimization scope for the MPI library implementation.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123677837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Why is MPI (perceived to be) so complex?: Part 1—Does strong progress simplify MPI? 为什么MPI(被认为)如此复杂?第1部分:强进步是否简化了MPI?

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416318

Daniel J. Holmes, A. Skjellum, Derek Schafer

{"title":"Why is MPI (perceived to be) so complex?: Part 1—Does strong progress simplify MPI?","authors":"Daniel J. Holmes, A. Skjellum, Derek Schafer","doi":"10.1145/3416315.3416318","DOIUrl":"https://doi.org/10.1145/3416315.3416318","url":null,"abstract":"Strong progress is optional in MPI. MPI allows implementations where progress (for example, updating the message-transport state machines or interaction with network devices) is only made during certain MPI procedure calls. Generally speaking, strong progress implies the ability to achieve progress (to transport data through the network from senders to receivers and exchange protocol messages) without explicit calls from user processes to MPI procedures. For instance, data given to a send procedure that matches a pre-posted receive on the receiving process is moved from source to destination in due course regardless of how often (including zero times) the sender or receiver processes call MPI in the meantime. Further, nonblocking operations and persistent collective operations work ‘in the background’ of user processes once all processes in the communicator’s group have performed the starting step for the operation. Overall, strong progress is meant to enhance the potential for overlap of communication and computation and improve predictability of procedure execution times by eliminating progress effort from user threads. This paper posits that strong progress is desirable as an MPI implementation property and examines whether strong progress: This paper explores such possibilities and sets forth principles that underpin MPI and interactions with normal and fault modes of operation. The key contribution of this paper is the conclusion that, whether measured by absolute performance, by performance portability, or by interface simplicity, strong progress in MPI is no worse than weak progress and, in most scenarios, has more potential to fulfil the aforementioned desirable attributes.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124701215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Implementation and performance evaluation of MPI persistent collectives in MPC: a case study MPI持续性集体在MPC的实施与绩效评估:个案研究

Proceedings of the 27th European MPI Users' Group Meeting Pub Date : 2020-09-21 DOI: 10.1145/3416315.3416321

Stéphane Bouhrour, Julien Jaeger

引用次数: 1