Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI:10.1145/3126908.3126963

Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji

{"title":"Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1","authors":"Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji","doi":"10.1145/3126908.3126963","DOIUrl":null,"url":null,"abstract":"This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $\\rightarrow$ Concurrent algorithms; Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3126908.3126963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $\rightarrow$ Concurrent algorithms; Massively parallel algorithms;

查看原文本刊更多论文

为什么MPI这么慢?分析MPI-3.1实现的基本限制

本文对MPI性能关键路径中的软件开销进行了深入分析，并揭示了基于MPI-3.1规范不可避免的强制性性能开销。我们首先提出了MPI-3.1标准的一个高度优化的实现，其中通信堆栈(从应用程序到低级网络通信api)只需要几十条指令。我们仔细研究了这些说明，并根据MPI标准的特定要求分析了开销的根本原因，这些要求在当前MPI标准下是不可避免的。我们建议对MPI标准进行潜在的更改，以尽量减少这些开销。我们在各种网络架构和应用程序上的实验结果表明，我们提出的更改带来了显著的好处。CCS CONCEPTS•计算方法$\right row$并发算法;大规模并行算法;

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量