Proceedings of the 26th European MPI Users' Group Meeting最新文献

筛选
英文 中文
Mixing ranks, tasks, progress and nonblocking collectives 混合队伍、任务、进度和不阻塞的集体
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343221
Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault
{"title":"Mixing ranks, tasks, progress and nonblocking collectives","authors":"Jean-Baptiste Besnard, Julien Jaeger, A. Malony, S. Shende, Hugo Taboada, Marc Pérache, Patrick Carribault","doi":"10.1145/3343211.3343221","DOIUrl":"https://doi.org/10.1145/3343211.3343221","url":null,"abstract":"Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such as MPI Sessions. In this paper, we propose to outline how such evolution could enable optimizations which were previously linked to specific MPI runtimes executing MPI processes in shared memory (e.g. thread-based MPI). By implementing runtime-level work-sharing through what we define as MPI tasks, enabling the ability to progress indifferently from stream context we show that there is potential for improved asynchronous progress. In the absence of a Session implementation, this assumption is validated in the context of a thread-based MPI where nonblocking Collective (NBC) were implemented on top of Extended Generic Requests progressed by any rank on the node thanks to an MPI extension enabling threads to dynamically share their MPI context.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129225639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A performance analysis and optimization of PMIx-based HPC software stacks 基于pmix的高性能计算软件栈性能分析与优化
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343220
A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova
{"title":"A performance analysis and optimization of PMIx-based HPC software stacks","authors":"A. Y. Polyakov, Boris I. Karasev, Joshua Hursey, Joshua Ladd, Mikhail Brinskii, E. Shipunova","doi":"10.1145/3343211.3343220","DOIUrl":"https://doi.org/10.1145/3343211.3343220","url":null,"abstract":"Process management libraries and runtime environments serve an important role in the HPC application lifecycle. This work provides a roadmap for implementing a high-performance PMIx based software stacks and targets four performance-critical areas presenting novel codesigned solutions that significantly improve application performance during initialization and wire-up at scale. First, the new locking and thread-safety schemes of the PMIx on-host communication are designed demonstrating up to 66x reduction in PMIx_Get latency. Second, the optimizations of protocols involved in the wire-up procedure are proposed. Specific improvements in the UCX endpoint address representation, the layout of PMIx metadata, and the use of Little-Endian Base 128 encoding decreased the volume of inter-node data exchanged by up to 8.6x. Third, a modification of the Bruck concatenation algorithm is presented that scales better than ring- and tree-based implementations currently used in resource managers for PMIx data exchange. Lastly, an out-of-band channel leveraging the high-performance fabric is evaluated demonstrating orders of magnitude performance improvement compared to the existing implementation.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Evaluating tradeoffs between MPI message matching offload hardware capacity and performance 评估MPI消息匹配卸载硬件容量和性能之间的权衡
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343223
Scott Levy, Kurt B. Ferreira
{"title":"Evaluating tradeoffs between MPI message matching offload hardware capacity and performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3343211.3343223","DOIUrl":"https://doi.org/10.1145/3343211.3343223","url":null,"abstract":"Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115060030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Minimizing the usage of hardware counters for collective communication using triggered operations 最小化使用触发操作进行集体通信的硬件计数器的使用
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343222
Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán
{"title":"Minimizing the usage of hardware counters for collective communication using triggered operations","authors":"Nusrat S. Islam, G. Zheng, S. Sur, Akhil Langer, M. Garzarán","doi":"10.1145/3343211.3343222","DOIUrl":"https://doi.org/10.1145/3343211.3343222","url":null,"abstract":"Triggered operations and counting events or counters are building blocks that can be used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all its dependent operations have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, specially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a default approach that does not consider the dependences among the operations.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117183962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Runtime level failure detection and propagation in HPC systems 运行时级故障检测和传播在高性能计算系统
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343225
Dong Zhong, Aurélien Bouteiller, Xi Luo, G. Bosilca
{"title":"Runtime level failure detection and propagation in HPC systems","authors":"Dong Zhong, Aurélien Bouteiller, Xi Luo, G. Bosilca","doi":"10.1145/3343211.3343225","DOIUrl":"https://doi.org/10.1145/3343211.3343225","url":null,"abstract":"As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129891896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Analysis of model parallelism for distributed neural networks 分布式神经网络模型并行性分析
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343218
Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato
{"title":"Analysis of model parallelism for distributed neural networks","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1145/3343211.3343218","DOIUrl":"https://doi.org/10.1145/3343211.3343218","url":null,"abstract":"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \"contributing\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block? MPI语义术语和约定的阐述、澄清和扩展:非阻塞MPI函数是否允许阻塞?
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343213
P. Bangalore, R. Rabenseifner, Daniel J. Holmes, Julien Jaeger, Guillaume Mercier, C. Blaas-Schenner, A. Skjellum
{"title":"Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block?","authors":"P. Bangalore, R. Rabenseifner, Daniel J. Holmes, Julien Jaeger, Guillaume Mercier, C. Blaas-Schenner, A. Skjellum","doi":"10.1145/3343211.3343213","DOIUrl":"https://doi.org/10.1145/3343211.3343213","url":null,"abstract":"This paper offers a timely study and proposed clarifications, revisions, and enhancements to the Message Passing Interface's (MPI's) Semantic Terms and Conventions. To enhance MPI, a clearer understanding of the meaning of the key terminology has proven essential, and, surprisingly, important concepts remain underspecified, ambiguous and, in some cases, inconsistent and/or conflicting despite 26 years of standardization. This work addresses these concerns comprehensively and usefully informs MPI developers, implementors, those teaching and learning MPI, and power users alike about key aspects of existing conventions, syntax, and semantics. This paper will also be a useful driver for great clarity in current and future standardization and implementation efforts for MPI.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126016072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Foreword EuroMPI 2019
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343212
J. Träff, T. Hoefler
{"title":"Foreword EuroMPI 2019","authors":"J. Träff, T. Hoefler","doi":"10.1145/3343211.3343212","DOIUrl":"https://doi.org/10.1145/3343211.3343212","url":null,"abstract":"","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124702367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
QMPI: a next generation MPI profiling interface for modern HPC platforms QMPI:用于现代HPC平台的下一代MPI分析接口
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343215
Bengisu Elis, Dai Yang, M. Schulz
{"title":"QMPI: a next generation MPI profiling interface for modern HPC platforms","authors":"Bengisu Elis, Dai Yang, M. Schulz","doi":"10.1145/3343211.3343215","DOIUrl":"https://doi.org/10.1145/3343211.3343215","url":null,"abstract":"As we approach exascale and start planning for beyond, the rising complexity of systems and applications demands new monitoring, analysis, and optimization approaches. This requires close coordination with the parallel programming system used, which for HPC in most cases includes MPI, the Message Passing Interface. While MPI provides comprehensive tool support in the form of the MPI Profiling interface, PMPI, which has inspired a generation of tools, it is not sufficient for the new arising challenges. In particular, it does not support modern software design principles nor the composition of multiple monitoring solutions from multiple agents or sources. We approach these gaps and present QMPI, as a possible successor to PMPI. In this paper, we present the use cases and requirements that drive its development, offer a prototype design and implementation, and demonstrate its effectiveness and low overhead.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
MPI tag matching performance on ConnectX and ARM 在ConnectX和ARM上的MPI标签匹配性能
Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI: 10.1145/3343211.3343224
W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges
{"title":"MPI tag matching performance on ConnectX and ARM","authors":"W. P. Marts, Matthew G. F. Dosanjh, W. Schonbein, Ryan E. Grant, Patrick G. Bridges","doi":"10.1145/3343211.3343224","DOIUrl":"https://doi.org/10.1145/3343211.3343224","url":null,"abstract":"As we approach Exascale, message matching has increasingly become a significant factor in HPC application performance. To address this, network vendors have placed higher precedence on improving MPI message matching performance. ConnectX-5, Mellanox's new network interface card, has both hardware and software matching layers. The performance characteristics of these layers have yet to be studied under real world circumstances. In this work we offer an initial evaluation of ConnectX-5 message matching performance. To analyze this new hardware we executed a series of micro-benchmarks and applications on Astra, an ARM-based ConnectX-5 HPC system, while varying hardware and software matching parameters. The benchmark results show the ConnectX-5 is sensitive to queue depths, and that hardware message matching increases performance for applications that send messages between 1KiB and 16KiB. Furthermore, the hardware matching system was capable of matching wildcard receives without negatively impacting performance. Finally, for some applications, a significant improvement can be observed when leveraging the ConnectX-5's hardware matching.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130534491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信