M. Diener, Sam White, L. Kalé, M. T. Campbell, D. Bodony, J. Freund
{"title":"Improving the memory access locality of hybrid MPI applications","authors":"M. Diener, Sam White, L. Kalé, M. T. Campbell, D. Bodony, J. Freund","doi":"10.1145/3127024.3127038","DOIUrl":"https://doi.org/10.1145/3127024.3127038","url":null,"abstract":"Maintaining memory access locality is continuing to be a challenge for parallel applications and their runtime environments. By exploiting locality, application performance, resource usage, and performance portability can be improved. The main challenge is to detect and fix memory locality issues for applications that use shared-memory programming models for intra-node parallelization. In this paper, we investigate improving memory access locality of a hybrid MPI+OpenMP application in two different ways, by manually fixing locality issues in its source code and by employing the Adaptive MPI (AMPI) runtime environment. Results show that AMPI can result in similar locality improvements as manual source code changes, leading to substantial performance and scalability gains compared to the unoptimized version and to a pure MPI runtime. Compared to the hybrid MPI+OpenMP baseline, our optimizations improved performance by 1.8x on a single cluster node, and by 1.4x on 32 nodes, with a speedup of 2.4x compared to a pure MPI execution on 32 nodes. In addition to performance, we also evaluate the impact of memory locality on the load balance within a node.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srinivasan Ramesh, Aurèle Mahéo, S. Shende, A. Malony, H. Subramoni, D. Panda
{"title":"MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU","authors":"Srinivasan Ramesh, Aurèle Mahéo, S. Shende, A. Malony, H. Subramoni, D. Panda","doi":"10.1145/3127024.3127036","DOIUrl":"https://doi.org/10.1145/3127024.3127036","url":null,"abstract":"MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to re-configure the MPI library dynamically at runtime to fine-tune performance. In this paper, we propose an infrastructure that extends existing components - TAU, MVAPICH2 and BEACON to take advantage of the MPI_T interface to offer runtime introspection, online monitoring, recommendation generation and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. We use our infrastructure to implement an autotuning policy for AmberMD[1] that monitors and reduces MVAPICH2 library internal memory footprint by 20% without affecting performance. For applications where collective communication is latency sensitive such as MiniAMR[2], our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, we see a 5% improvement in application runtime.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134143724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergio Rivas-Gomez, R. Gioiosa, I. Peng, Gokcen Kestor, Sai B. Narasimhamurthy, E. Laure, S. Markidis
{"title":"MPI windows on storage for HPC applications","authors":"Sergio Rivas-Gomez, R. Gioiosa, I. Peng, Gokcen Kestor, Sai B. Narasimhamurthy, E. Laure, S. Markidis","doi":"10.1145/3127024.3127034","DOIUrl":"https://doi.org/10.1145/3127024.3127034","url":null,"abstract":"Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI windows on storage, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. Using a modified STREAM micro-benchmark, we measure the sustained bandwidth of MPI windows on storage against MPI memory windows and observe that only a 10% performance penalty is incurred. When using parallel file systems such as Lustre, asymmetric performance is observed with a 10% performance penalty in reading operations and a 90% in writing operations. Nonetheless, experimental results of a Distributed Hash Table and the HACC I/O kernel mini-application show that the overall penalty of MPI windows on storage can be negligible in most cases on real-world applications.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Eberius, Thananon Patinyasakdikul, G. Bosilca
{"title":"Using software-based performance counters to expose low-level open MPI performance information","authors":"David Eberius, Thananon Patinyasakdikul, G. Bosilca","doi":"10.1145/3127024.3127039","DOIUrl":"https://doi.org/10.1145/3127024.3127039","url":null,"abstract":"This paper details the implementation and usage of software-based performance counters to understand the performance of a particular implementation of the MPI standard, Open MPI. Such counters can expose intrinsic features of the software stack that are not available otherwise in a generic and portable way. The PMPI-interface is useful for instrumenting MPI applications at a user level, however it is insufficient for providing meaningful internal MPI performance details. While the Peruse interface provides more detailed information on state changes within Open MPI, it has not seen widespread adoption. We introduce a simple low-level approach that instruments the Open MPI code at key locations to provide fine-grained MPI performance metrics. We evaluate the overhead associated with adding these counters to Open MPI as well as their use in determining bottlenecks and areas for improvement both in user code and the MPI implementation itself.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132251691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Verification of MPI programs using CIVL","authors":"Ziqing Luo, Manchun Zheng, Stephen F. Siegel","doi":"10.1145/3127024.3127032","DOIUrl":"https://doi.org/10.1145/3127024.3127032","url":null,"abstract":"CIVL is a framework for verifying concurrent programs. The framework is built around a language, CIVL-C, that extends sequential C with general-purpose primitives that can be used to model a variety of concurrency dialects, including OpenMP, Pthreads, CUDA, and MPI. The framework automatically transforms programs using those dialects into CIVL-C so that static analysis and verification tools for CIVL-C can be applied. This paper describes how C/MPI programs are so transformed. The result is a verifier that can check, within finite bounds, a number of difficult properties of MPI programs, including functional correctness, deadlock-freedom, and adherence to rules specified in the MPI Standard.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127116409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily
{"title":"What does fault tolerant deep learning need from MPI?","authors":"Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily","doi":"10.1145/3127024.3127037","DOIUrl":"https://doi.org/10.1145/3127024.3127037","url":null,"abstract":"Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for designing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134288288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Practical, linear-time, fully distributed algorithms for irregular gather and scatter","authors":"J. Träff","doi":"10.1145/3127024.3127025","DOIUrl":"https://doi.org/10.1145/3127024.3127025","url":null,"abstract":"We present new, simple, fully distributed, practical algorithms with linear time communication cost for irregular gather and scatter operations in which processors contribute or consume possibly different amounts of data. In a homogeneous, linear cost transmission model with start-up latency α and cost per unit β, the new algorithms take time 3⌈log2p⌉α + β Σi≠r mi where p is the number of processors, mi the amount of data for processor i, 0 ≤ i < p, and processor r, 0 ≤ r < p a root processor determined by the algorithm. With a fixed, externally given root processor r, there is an additive time penalty of at most β(Md' − mrd' − Σ0≤j<d' Mj) for some d' < ⌈log2 p⌉, where each Mj is the total amount of data in a tree of 2j different processors with roots rj as constructed by the algorithm. The worst-case time penalty is less than β Σi≠r mi. The algorithms have attractive properties for implementing the operations for MPI (the Message-Passing Interface). Standard algorithms using fixed trees take time either ⌈log2 p⌉(α + β Σi≠r mi) in the worst case, or (p − 1)α + Σi≠r βmi. We have used the new algorithms to give prototype implementations for the MPI_Gatherv and MPI_Scatterv collectives of MPI, and present benchmark results from a small and a medium-large InfiniBand cluster. In order to structure the experimental evaluation we formulate new performance guidelines for irregular collectives that can be used to assess the performance in relation to the corresponding regular collectives. We show that the new algorithms can fulfill these performance expectations within a large margin, and that standard implementations do not.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124289542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}