Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault
{"title":"Mixing ranks, tasks, progress and nonblocking collectives","authors":"Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault","doi":"10.1145/3343211.3343221","DOIUrl":"https://doi.org/10.1145/3343211.3343221","url":null,"abstract":"Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such as MPI Sessions. In this paper, we propose to outline how such evolution could enable optimizations which were previously linked to specific MPI runtimes executing MPI processes in shared memory (e.g. thread-based MPI). By implementing runtime-level work-sharing through what we define as MPI tasks, enabling the ability to progress indifferently from stream context we show that there is potential for improved asynchronous progress. In the absence of a Session implementation, this assumption is validated in the context of a thread-based MPI where nonblocking Collective (NBC) were implemented on top of Extended Generic Requests progressed by any rank on the node thanks to an MPI extension enabling threads to dynamically share their MPI context.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129225639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova
{"title":"A performance analysis and optimization of PMIx-based HPC software stacks","authors":"A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova","doi":"10.1145/3343211.3343220","DOIUrl":"https://doi.org/10.1145/3343211.3343220","url":null,"abstract":"Process management libraries and runtime environments serve an important role in the HPC application lifecycle. This work provides a roadmap for implementing a high-performance PMIx based software stacks and targets four performance-critical areas presenting novel codesigned solutions that significantly improve application performance during initialization and wire-up at scale. First, the new locking and thread-safety schemes of the PMIx on-host communication are designed demonstrating up to 66x reduction in PMIx_Get latency. Second, the optimizations of protocols involved in the wire-up procedure are proposed. Specific improvements in the UCX endpoint address representation, the layout of PMIx metadata, and the use of Little-Endian Base 128 encoding decreased the volume of inter-node data exchanged by up to 8.6x. Third, a modification of the Bruck concatenation algorithm is presented that scales better than ring- and tree-based implementations currently used in resource managers for PMIx data exchange. Lastly, an out-of-band channel leveraging the high-performance fabric is evaluated demonstrating orders of magnitude performance improvement compared to the existing implementation.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating tradeoffs between MPI message matching offload hardware capacity and performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3343211.3343223","DOIUrl":"https://doi.org/10.1145/3343211.3343223","url":null,"abstract":"Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115060030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán
{"title":"Minimizing the usage of hardware counters for collective communication using triggered operations","authors":"Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán","doi":"10.1145/3343211.3343222","DOIUrl":"https://doi.org/10.1145/3343211.3343222","url":null,"abstract":"Triggered operations and counting events or counters are building blocks that can be used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all its dependent operations have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, specially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a default approach that does not consider the dependences among the operations.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117183962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Zhong, Aurélien Bouteiller, Xi Luo, G. Bosilca
{"title":"Runtime level failure detection and propagation in HPC systems","authors":"Dong Zhong, Aurélien Bouteiller, Xi Luo, G. Bosilca","doi":"10.1145/3343211.3343225","DOIUrl":"https://doi.org/10.1145/3343211.3343225","url":null,"abstract":"As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129891896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato
{"title":"Analysis of model parallelism for distributed neural networks","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1145/3343211.3343218","DOIUrl":"https://doi.org/10.1145/3343211.3343218","url":null,"abstract":"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \"contributing\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Bangalore, R. Rabenseifner, Daniel J. Holmes, Julien Jaeger, Guillaume Mercier, C. Blaas-Schenner, A. Skjellum
{"title":"Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block?","authors":"P. Bangalore, R. Rabenseifner, Daniel J. Holmes, Julien Jaeger, Guillaume Mercier, C. Blaas-Schenner, A. Skjellum","doi":"10.1145/3343211.3343213","DOIUrl":"https://doi.org/10.1145/3343211.3343213","url":null,"abstract":"This paper offers a timely study and proposed clarifications, revisions, and enhancements to the Message Passing Interface's (MPI's) Semantic Terms and Conventions. To enhance MPI, a clearer understanding of the meaning of the key terminology has proven essential, and, surprisingly, important concepts remain underspecified, ambiguous and, in some cases, inconsistent and/or conflicting despite 26 years of standardization. This work addresses these concerns comprehensively and usefully informs MPI developers, implementors, those teaching and learning MPI, and power users alike about key aspects of existing conventions, syntax, and semantics. This paper will also be a useful driver for great clarity in current and future standardization and implementation efforts for MPI.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126016072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foreword EuroMPI 2019","authors":"J. Träff, T. Hoefler","doi":"10.1145/3343211.3343212","DOIUrl":"https://doi.org/10.1145/3343211.3343212","url":null,"abstract":"","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124702367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QMPI: a next generation MPI profiling interface for modern HPC platforms","authors":"Bengisu Elis, Dai Yang, M. Schulz","doi":"10.1145/3343211.3343215","DOIUrl":"https://doi.org/10.1145/3343211.3343215","url":null,"abstract":"As we approach exascale and start planning for beyond, the rising complexity of systems and applications demands new monitoring, analysis, and optimization approaches. This requires close coordination with the parallel programming system used, which for HPC in most cases includes MPI, the Message Passing Interface. While MPI provides comprehensive tool support in the form of the MPI Profiling interface, PMPI, which has inspired a generation of tools, it is not sufficient for the new arising challenges. In particular, it does not support modern software design principles nor the composition of multiple monitoring solutions from multiple agents or sources. We approach these gaps and present QMPI, as a possible successor to PMPI. In this paper, we present the use cases and requirements that drive its development, offer a prototype design and implementation, and demonstrate its effectiveness and low overhead.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges
{"title":"MPI tag matching performance on ConnectX and ARM","authors":"W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges","doi":"10.1145/3343211.3343224","DOIUrl":"https://doi.org/10.1145/3343211.3343224","url":null,"abstract":"As we approach Exascale, message matching has increasingly become a significant factor in HPC application performance. To address this, network vendors have placed higher precedence on improving MPI message matching performance. ConnectX-5, Mellanox's new network interface card, has both hardware and software matching layers. The performance characteristics of these layers have yet to be studied under real world circumstances. In this work we offer an initial evaluation of ConnectX-5 message matching performance. To analyze this new hardware we executed a series of micro-benchmarks and applications on Astra, an ARM-based ConnectX-5 HPC system, while varying hardware and software matching parameters. The benchmark results show the ConnectX-5 is sensitive to queue depths, and that hardware message matching increases performance for applications that send messages between 1KiB and 16KiB. Furthermore, the hardware matching system was capable of matching wildcard receives without negatively impacting performance. Finally, for some applications, a significant improvement can be observed when leveraging the ConnectX-5's hardware matching.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130534491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}