{"title":"A hierarchical model to manage hardware topology in MPI applications","authors":"E. Jeannot, Farouk Mansouri, Guillaume Mercier","doi":"10.1145/3127024.3127030","DOIUrl":"https://doi.org/10.1145/3127024.3127030","url":null,"abstract":"The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90's it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hatanaka, Masamichi Takagi, A. Hori, Y. Ishikawa
{"title":"Offloaded MPI persistent collectives using persistent generalized request interface","authors":"M. Hatanaka, Masamichi Takagi, A. Hori, Y. Ishikawa","doi":"10.1145/3127024.3127029","DOIUrl":"https://doi.org/10.1145/3127024.3127029","url":null,"abstract":"This paper proposes a library with a persistent generalized request interface for the implementation of persistent communication operations. This interface allows developers to add persistent communication functions to the existing MPI library. We implemented a new generalized request interface which supports persistent communications because the generalized requests of the MPI standard lacks the features needed for persistent communications. We evaluate the expressiveness of the interface by developing five implementations of a persistent collective operation, namely, MPI_Neighbor_-alltoall_init: one utilizes the collective offload capability of Fujitsu FX100 Tofu2 interconnect and other four utilize the standard MPI functions and the Fujitsu-extended MPI functions. These implementations are evaluated on FX100 with micro-benchmark programs measuring latency. The results show that the offloaded version outperforms the existing implementations by more than a factor of two with data sizes up to 16 KiB, confirming that the proposed library interface facilitates the development of persistent collectives and the offloaded implementation exhibits the expected performance.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115197176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Notified access in coarray fortran","authors":"A. Fanfarillo, D. D. Vento","doi":"10.1145/3127024.3127026","DOIUrl":"https://doi.org/10.1145/3127024.3127026","url":null,"abstract":"With the increasing availability of the Remote Direct Memory Access (RDMA) support in computer networks, the so called Partitioned Global Address Space (PGAS) model has evolved in the last few years. Although there are several cases where a PGAS approach can easily solve difficult message passing situations, like in particle tracking and adaptive mesh refinement applications, the producer-consumer pattern, usually adopted in task-based parallelism, can only be implemented inefficiently because of the separation between data transfer and synchronization (which is usually unified in message passing programming models). In this paper, we provide two contributions: 1) we propose an extension for the Fortran language that provides the concept of Notified Access by associating regular coarray variables with event variables. 2) We demonstrate that the MPI extension proposed by foMPI for Notified Access can be used effectively to implement the same concept in a PGAS run-time library like OpenCoarrays.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115393539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Pickartz, Carsten Clauss, Stefan Lankes, A. Monti
{"title":"Enabling hierarchy-aware MPI collectives in dynamically changing topologies","authors":"Simon Pickartz, Carsten Clauss, Stefan Lankes, A. Monti","doi":"10.1145/3127024.3127031","DOIUrl":"https://doi.org/10.1145/3127024.3127031","url":null,"abstract":"Hierarchy-awareness for message-passing has been around since the early 2000s with the emergence of SMP systems. Since then, many works dealt with the optimization of collective communication operations (so-called collectives) for such hierarchical topologies. However, until now, all these optimizations basically assume that the hierarchical topology remains static in a parallel program. In contrast, this paper strives for a discussion of how dynamically changing topologies can be considered during runtime, especially with focus on collective communication patterns. In doing so, the discussion starter for this is the possibility of process migration, e. g., in virtualized environments where the MPI processes are encapsulated within virtual machines. Consequently, processes originally located on distinct nodes can then (dynamically) become neighbors on the same SMP node. The central subject for the discussion on how such changes can be taken into account for optimized collectives is a new experimental MPI function that we propose and detail within this paper.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131722397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 24th European MPI Users' Group Meeting","authors":"Antonio J. Peña, P. Balaji, W. Gropp, R. Thakur","doi":"10.1145/3127024","DOIUrl":"https://doi.org/10.1145/3127024","url":null,"abstract":"","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Castain, David G. Solt, Joshua Hursey, Aurélien Bouteiller
{"title":"PMIx: process management for exascale environments","authors":"R. Castain, David G. Solt, Joshua Hursey, Aurélien Bouteiller","doi":"10.1145/3127024.3127027","DOIUrl":"https://doi.org/10.1145/3127024.3127027","url":null,"abstract":"High-Performance Computing (HPC) applications have historically executed in static resource allocations, using programming models that ran independently from the resident system management stack (SMS). Achieving exascale performance that is both cost-effective and fits within site-level environmental constraints will, however, require that the application and SMS collaboratively orchestrate the flow of work to optimize resource utilization and compensate for on-the-fly faults. The Process Management Interface - Exascale (PMIx) community is committed to establishing scalable workflow orchestration by defining an abstract set of interfaces by which not only applications and tools can interact with the resident SMS, but also the various SMS components can interact with each other. This paper presents a high-level overview of the goals and current state of the PMIx standard, and lays out a roadmap for future directions.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124207114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kurt B. Ferreira, Scott Levy, K. Pedretti, Ryan E. Grant
{"title":"Characterizing MPI matching via trace-based simulation","authors":"Kurt B. Ferreira, Scott Levy, K. Pedretti, Ryan E. Grant","doi":"10.1145/3127024.3127040","DOIUrl":"https://doi.org/10.1145/3127024.3127040","url":null,"abstract":"With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application's execution, including its matching performance, and is highly dependent on the MPI library's matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123177156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced memory management for scalable MPI intra-node communication on many-core processor","authors":"Joong-Yeon Cho, Hyun-Wook Jin, Dukyun Nam","doi":"10.1145/3127024.3127035","DOIUrl":"https://doi.org/10.1145/3127024.3127035","url":null,"abstract":"As the number of cores installed in a single computing node drastically increases, the intra-node communication between parallel processes becomes more important. The parallel programming models, such as Message Passing Interface (MPI), internally perform memory-intensive operations for intra-node communication. Thus, to address the scalability issue on many-core processors, it is critical to exploit emerging memory features provided by the contemporary computer systems. For example, the latest many-core processors are equipped with a high-bandwidth on-package memory Modern 64-bit processors also support a large page size (e.g., 2MB), which can significantly reduce the number of TLB misses. The on-package memory and the huge pages have considerable potential for improving the performance of intra-node communication. However, such features are not thoroughly investigated in terms of intra-node communication in the literature. In this paper, we propose enhanced memory management schemes to efficiently utilize the on-package memory and provide support for huge pages. The proposed schemes can significantly reduce the data copy and memory mapping overheads in MPI intra-node communication. Our experimental results show that our implementation on MVAPICH2 can improve the bandwidth of point-to-point communication up to 373%, and can reduce the latency of collective communication by 79% on an Intel Xeon Phi Knights Landing (KNL) processor.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129545075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Morgan, Daniel J. Holmes, A. Skjellum, P. Bangalore, Srinivas Sridharan
{"title":"Planning for performance: persistent collective operations for MPI","authors":"B. Morgan, Daniel J. Holmes, A. Skjellum, P. Bangalore, Srinivas Sridharan","doi":"10.1145/3127024.3127028","DOIUrl":"https://doi.org/10.1145/3127024.3127028","url":null,"abstract":"Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, more optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example illustrating the potential performance enhancements for such operations. Persistent operations allow MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., overlap of communication with computation or other communication). Further enhancement of the current implementation prototype as well as additional opportunities to enhance performance through the application of these new APIs comprise future work.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122393700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ahmed, A. Skjellum, P. Bangalore, P. Pirkelbauer
{"title":"Transforming blocking MPI collectives to Non-blocking and persistent operations","authors":"H. Ahmed, A. Skjellum, P. Bangalore, P. Pirkelbauer","doi":"10.1145/3127024.3127033","DOIUrl":"https://doi.org/10.1145/3127024.3127033","url":null,"abstract":"This paper describes Petal, a prototype tool that uses compiler-analysis techniques to automate code transformations to hide communication costs behind computation by replacing blocking MPI functions with corresponding nonblocking and persistent collective operations while maintaining legacy applications' correctness. In earlier work, we have already demonstrated Petal's ability to transform point-to-point MPI operations in complement to the results shown here. The contributions of this paper include the approach to collective operation transformations, a description of achieved functionality, examples of transformations, and demonstration of performance improvements obtained thus far on representative sample MPI programs. Depending on system scale and problem size, the transformations yield a speedup of up to a factor of two. This tool can be used to transform useful classes of new and legacy MPI programs to use the newest variants of MPI functions to improve performance without manual intervention for forthcoming HPC systems and updated versions of the MPI standard.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127564166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}