Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI:10.1109/SC.2014.45

Srinivas Sridharan, James Dinan, Dhiraj D. Kalamkar

{"title":"Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints","authors":"Srinivas Sridharan, James Dinan, Dhiraj D. Kalamkar","doi":"10.1109/SC.2014.45","DOIUrl":null,"url":null,"abstract":"Modern high-speed interconnection networks are designed with capabilities to support communication from multiple processor cores. The MPI endpoints extension has been proposed to ease process and thread count tradeoffs by enabling multithreaded MPI applications to efficiently drive independent network communication. In this work, we present the first implementation of the MPI endpoints interface and demonstrate the first applications running on this new interface. We use a novel library-based design that can be layered on top of any existing, production MPI implementation. Our approach uses proxy processes to isolate threads in an MPI job, eliminating threading overheads within the MPI library and allowing threads to achieve process-like communication performance. We evaluate the performance advantages of our implementation through several benchmarks and kernels. Performance results for the Lattice QCD Dslash kernel indicate that endpoints provides up to 2.9× improvement in communication performance and 1.87× overall performance improvement over a highly optimized hybrid MPI+OpenMP baseline on 128 processors.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2014.45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

Modern high-speed interconnection networks are designed with capabilities to support communication from multiple processor cores. The MPI endpoints extension has been proposed to ease process and thread count tradeoffs by enabling multithreaded MPI applications to efficiently drive independent network communication. In this work, we present the first implementation of the MPI endpoints interface and demonstrate the first applications running on this new interface. We use a novel library-based design that can be layered on top of any existing, production MPI implementation. Our approach uses proxy processes to isolate threads in an MPI job, eliminating threading overheads within the MPI library and allowing threads to achieve process-like communication performance. We evaluate the performance advantages of our implementation through several benchmarks and kernels. Performance results for the Lattice QCD Dslash kernel indicate that endpoints provides up to 2.9× improvement in communication performance and 1.87× overall performance improvement over a highly optimized hybrid MPI+OpenMP baseline on 128 processors.

查看原文本刊更多论文

通过基于库的MPI端点实现实现高效的多线程MPI通信

现代高速互连网络的设计具有支持多处理器核心通信的能力。MPI端点扩展已经提出，通过使多线程MPI应用程序能够有效地驱动独立的网络通信，从而减轻进程和线程数的权衡。在这项工作中，我们提出了MPI端点接口的第一个实现，并演示了在这个新接口上运行的第一个应用程序。我们使用一种新颖的基于库的设计，可以在任何现有的生产MPI实现之上分层。我们的方法使用代理进程来隔离MPI作业中的线程，消除MPI库中的线程开销，并允许线程实现类似进程的通信性能。我们通过几个基准测试和内核来评估实现的性能优势。Lattice QCD Dslash内核的性能结果表明，在128个处理器上，与高度优化的MPI+OpenMP混合基线相比，端点提供了高达2.9倍的通信性能改进和1.87倍的总体性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量