Proceedings of the 26th European MPI Users' Group Meeting最新文献

Mixing ranks, tasks, progress and nonblocking collectives 混合队伍、任务、进度和不阻塞的集体

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343221

Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault

{"title":"Mixing ranks, tasks, progress and nonblocking collectives","authors":"Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault","doi":"10.1145/3343211.3343221","DOIUrl":"https://doi.org/10.1145/3343211.3343221","url":null,"abstract":"Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such as MPI Sessions. In this paper, we propose to outline how such evolution could enable optimizations which were previously linked to specific MPI runtimes executing MPI processes in shared memory (e.g. thread-based MPI). By implementing runtime-level work-sharing through what we define as MPI tasks, enabling the ability to progress indifferently from stream context we show that there is potential for improved asynchronous progress. In the absence of a Session implementation, this assumption is validated in the context of a thread-based MPI where nonblocking Collective (NBC) were implemented on top of Extended Generic Requests progressed by any rank on the node thanks to an MPI extension enabling threads to dynamically share their MPI context.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129225639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A performance analysis and optimization of PMIx-based HPC software stacks 基于pmix的高性能计算软件栈性能分析与优化

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343220

A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova

{"title":"A performance analysis and optimization of PMIx-based HPC software stacks","authors":"A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova","doi":"10.1145/3343211.3343220","DOIUrl":"https://doi.org/10.1145/3343211.3343220","url":null,"abstract":"Process management libraries and runtime environments serve an important role in the HPC application lifecycle. This work provides a roadmap for implementing a high-performance PMIx based software stacks and targets four performance-critical areas presenting novel codesigned solutions that significantly improve application performance during initialization and wire-up at scale. First, the new locking and thread-safety schemes of the PMIx on-host communication are designed demonstrating up to 66x reduction in PMIx_Get latency. Second, the optimizations of protocols involved in the wire-up procedure are proposed. Specific improvements in the UCX endpoint address representation, the layout of PMIx metadata, and the use of Little-Endian Base 128 encoding decreased the volume of inter-node data exchanged by up to 8.6x. Third, a modification of the Bruck concatenation algorithm is presented that scales better than ring- and tree-based implementations currently used in resource managers for PMIx data exchange. Lastly, an out-of-band channel leveraging the high-performance fabric is evaluated demonstrating orders of magnitude performance improvement compared to the existing implementation.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Evaluating tradeoffs between MPI message matching offload hardware capacity and performance 评估MPI消息匹配卸载硬件容量和性能之间的权衡

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343223

Scott Levy, Kurt B. Ferreira

{"title":"Evaluating tradeoffs between MPI message matching offload hardware capacity and performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3343211.3343223","DOIUrl":"https://doi.org/10.1145/3343211.3343223","url":null,"abstract":"Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115060030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Minimizing the usage of hardware counters for collective communication using triggered operations 最小化使用触发操作进行集体通信的硬件计数器的使用

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343222

Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán

{"title":"Minimizing the usage of hardware counters for collective communication using triggered operations","authors":"Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán","doi":"10.1145/3343211.3343222","DOIUrl":"https://doi.org/10.1145/3343211.3343222","url":null,"abstract":"Triggered operations and counting events or counters are building blocks that can be used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all its dependent operations have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, specially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a default approach that does not consider the dependences among the operations.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117183962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Runtime level failure detection and propagation in HPC systems 运行时级故障检测和传播在高性能计算系统

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343225

Dong Zhong, Aurélien Bouteiller, Xi Luo, G. Bosilca

引用次数: 9

Analysis of model parallelism for distributed neural networks 分布式神经网络模型并行性分析

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343218

Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato

{"title":"Analysis of model parallelism for distributed neural networks","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1145/3343211.3343218","DOIUrl":"https://doi.org/10.1145/3343211.3343218","url":null,"abstract":"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \"contributing\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block? MPI语义术语和约定的阐述、澄清和扩展：非阻塞MPI函数是否允许阻塞？

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343213

P. Bangalore, R. Rabenseifner, Daniel J. Holmes, Julien Jaeger, Guillaume Mercier, C. Blaas-Schenner, A. Skjellum

引用次数: 8

Foreword EuroMPI 2019

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343212

J. Träff, T. Hoefler

引用次数: 1

QMPI: a next generation MPI profiling interface for modern HPC platforms QMPI:用于现代HPC平台的下一代MPI分析接口

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343215

Bengisu Elis, Dai Yang, M. Schulz

引用次数: 10

MPI tag matching performance on ConnectX and ARM 在ConnectX和ARM上的MPI标签匹配性能

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343224

W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges

{"title":"MPI tag matching performance on ConnectX and ARM","authors":"W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges","doi":"10.1145/3343211.3343224","DOIUrl":"https://doi.org/10.1145/3343211.3343224","url":null,"abstract":"As we approach Exascale, message matching has increasingly become a significant factor in HPC application performance. To address this, network vendors have placed higher precedence on improving MPI message matching performance. ConnectX-5, Mellanox's new network interface card, has both hardware and software matching layers. The performance characteristics of these layers have yet to be studied under real world circumstances. In this work we offer an initial evaluation of ConnectX-5 message matching performance. To analyze this new hardware we executed a series of micro-benchmarks and applications on Astra, an ARM-based ConnectX-5 HPC system, while varying hardware and software matching parameters. The benchmark results show the ConnectX-5 is sensitive to queue depths, and that hardware message matching increases performance for applications that send messages between 1KiB and 16KiB. Furthermore, the hardware matching system was capable of matching wildcard receives without negatively impacting performance. Finally, for some applications, a significant improvement can be observed when leveraging the ConnectX-5's hardware matching.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130534491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4