Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji
{"title":"Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1","authors":"Kenneth Raffenetti, A. Amer, Lena Oden, C. Archer, Wesley Bland, H. Fujita, Yanfei Guo, T. Janjusic, D. Durnov, M. Blocksome, Min Si, Sangmin Seo, Akhil Langer, G. Zheng, Masamichi Takagi, Paul K. Coffman, Jithin Jose, S. Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, M. Hatanaka, Xin Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, P. Balaji","doi":"10.1145/3126908.3126963","DOIUrl":null,"url":null,"abstract":"This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $\\rightarrow$ Concurrent algorithms; Massively parallel algorithms;","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3126908.3126963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1 specification. We first present a highly optimized implementation of the MPI-3.1 standard in which the communication stack-all the way from the application to the low-level network communication API-takes only a few tens of instructions. We carefully study these instructions and analyze the root cause of the overheads based on specific requirements from the MPI standard that are unavoidable under the current MPI standard. We recommend potential changes to the MPI standard that can minimize these overheads. Our experimental results on a variety of network architectures and applications demonstrate significant benefits from our proposed changes. CCS CONCEPTS • Computing methodologies $\rightarrow$ Concurrent algorithms; Massively parallel algorithms;