Proceedings of the 24th European MPI Users' Group Meeting最新文献_第2页

Improving the memory access locality of hybrid MPI applications 改进混合MPI应用程序的内存访问局部性

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-25 DOI: 10.1145/3127024.3127038

M. Diener, Sam White, L. Kalé, M. T. Campbell, D. Bodony, J. Freund

{"title":"Improving the memory access locality of hybrid MPI applications","authors":"M. Diener, Sam White, L. Kalé, M. T. Campbell, D. Bodony, J. Freund","doi":"10.1145/3127024.3127038","DOIUrl":"https://doi.org/10.1145/3127024.3127038","url":null,"abstract":"Maintaining memory access locality is continuing to be a challenge for parallel applications and their runtime environments. By exploiting locality, application performance, resource usage, and performance portability can be improved. The main challenge is to detect and fix memory locality issues for applications that use shared-memory programming models for intra-node parallelization. In this paper, we investigate improving memory access locality of a hybrid MPI+OpenMP application in two different ways, by manually fixing locality issues in its source code and by employing the Adaptive MPI (AMPI) runtime environment. Results show that AMPI can result in similar locality improvements as manual source code changes, leading to substantial performance and scalability gains compared to the unoptimized version and to a pure MPI runtime. Compared to the hybrid MPI+OpenMP baseline, our optimizations improved performance by 1.8x on a single cluster node, and by 1.4x on 32 nodes, with a speedup of 2.4x compared to a pure MPI execution on 32 nodes. In addition to performance, we also evaluate the impact of memory locality on the load balance within a node.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU MPI性能工程与MPI工具接口:MVAPICH和TAU的集成

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-25 DOI: 10.1145/3127024.3127036

Srinivasan Ramesh, Aurèle Mahéo, S. Shende, A. Malony, H. Subramoni, D. Panda

{"title":"MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU","authors":"Srinivasan Ramesh, Aurèle Mahéo, S. Shende, A. Malony, H. Subramoni, D. Panda","doi":"10.1145/3127024.3127036","DOIUrl":"https://doi.org/10.1145/3127024.3127036","url":null,"abstract":"MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to re-configure the MPI library dynamically at runtime to fine-tune performance. In this paper, we propose an infrastructure that extends existing components - TAU, MVAPICH2 and BEACON to take advantage of the MPI_T interface to offer runtime introspection, online monitoring, recommendation generation and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. We use our infrastructure to implement an autotuning policy for AmberMD[1] that monitors and reduces MVAPICH2 library internal memory footprint by 20% without affecting performance. For applications where collective communication is latency sensitive such as MiniAMR[2], our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, we see a 5% improvement in application runtime.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134143724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

MPI windows on storage for HPC applications 用于HPC应用程序的MPI存储窗口

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-25 DOI: 10.1145/3127024.3127034

Sergio Rivas-Gomez, R. Gioiosa, I. Peng, Gokcen Kestor, Sai B. Narasimhamurthy, E. Laure, S. Markidis

引用次数: 16

Using software-based performance counters to expose low-level open MPI performance information 使用基于软件的性能计数器公开低级开放MPI性能信息

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-25 DOI: 10.1145/3127024.3127039

David Eberius, Thananon Patinyasakdikul, G. Bosilca

引用次数: 9

Verification of MPI programs using CIVL 使用CIVL验证MPI程序

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-25 DOI: 10.1145/3127024.3127032

Ziqing Luo, Manchun Zheng, Stephen F. Siegel

引用次数: 17

What does fault tolerant deep learning need from MPI? 容错深度学习需要MPI做什么?

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-11 DOI: 10.1145/3127024.3127037

Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily

{"title":"What does fault tolerant deep learning need from MPI?","authors":"Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily","doi":"10.1145/3127024.3127037","DOIUrl":"https://doi.org/10.1145/3127024.3127037","url":null,"abstract":"Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for designing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134288288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Practical, linear-time, fully distributed algorithms for irregular gather and scatter 实用的，线性时间，完全分布的算法不规则收集和分散

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-02-20 DOI: 10.1145/3127024.3127025

J. Träff

{"title":"Practical, linear-time, fully distributed algorithms for irregular gather and scatter","authors":"J. Träff","doi":"10.1145/3127024.3127025","DOIUrl":"https://doi.org/10.1145/3127024.3127025","url":null,"abstract":"We present new, simple, fully distributed, practical algorithms with linear time communication cost for irregular gather and scatter operations in which processors contribute or consume possibly different amounts of data. In a homogeneous, linear cost transmission model with start-up latency α and cost per unit β, the new algorithms take time 3⌈log2p⌉α + β Σi≠r mi where p is the number of processors, mi the amount of data for processor i, 0 ≤ i < p, and processor r, 0 ≤ r < p a root processor determined by the algorithm. With a fixed, externally given root processor r, there is an additive time penalty of at most β(Md' − mrd' − Σ0≤j<d' Mj) for some d' < ⌈log2 p⌉, where each Mj is the total amount of data in a tree of 2j different processors with roots rj as constructed by the algorithm. The worst-case time penalty is less than β Σi≠r mi. The algorithms have attractive properties for implementing the operations for MPI (the Message-Passing Interface). Standard algorithms using fixed trees take time either ⌈log2 p⌉(α + β Σi≠r mi) in the worst case, or (p − 1)α + Σi≠r βmi. We have used the new algorithms to give prototype implementations for the MPI_Gatherv and MPI_Scatterv collectives of MPI, and present benchmark results from a small and a medium-large InfiniBand cluster. In order to structure the experimental evaluation we formulate new performance guidelines for irregular collectives that can be used to assess the performance in relation to the corresponding regular collectives. We show that the new algorithms can fulfill these performance expectations within a large margin, and that standard implementations do not.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124289542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5