A. Koniges, B. Cook, J. Deslippe, T. Kurth, H. Shan
{"title":"MPI usage at NERSC: Present and Future","authors":"A. Koniges, B. Cook, J. Deslippe, T. Kurth, H. Shan","doi":"10.1145/2966884.2966894","DOIUrl":"https://doi.org/10.1145/2966884.2966894","url":null,"abstract":"In this poster, we describe how MPI is used at the National Energy Research Scientific Computing Center (NERSC) NERSC is the production high-performance computing center for the US Department of Energy, with more than 5000 users and 800 distinct projects. Through a variety of tools (e.g., User Survey, application team collaborations, etc.), we determine how MPI is used on our latest systems, with a particular focus on advanced features and how early applications intend to use MPI on NERSC's upcoming Intel Knights Landing (KNL) many-core system1 - one of the first to be deployed. In the poster, we also compare the usage of MPI to exascale developmental programming models such as UPC++ and HPX, with an eye on what features and extensions to MPI are plausible and useful for NERSC users. We also discuss perceived shortcomings of MPI, and why certain groups use other parallel programming models on the systems. In addition to a broad survey of the NERSC HPC population, we follow the evolution of a few key application codes2 that are being highly optimized for the KNL architecture using advanced OpenMP techniques. We study how these highly optimized on-node proxy apps and full applications start to make the transition to using full hybrid MPI+OpenMP implementations on the self-hosted KNL system.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132634865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Baptiste Besnard, Julien Adam, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger
{"title":"Introducing Task-Containers as an Alternative to Runtime-Stacking","authors":"Jean-Baptiste Besnard, Julien Adam, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/2966884.2966910","DOIUrl":"https://doi.org/10.1145/2966884.2966910","url":null,"abstract":"The advent of many-core architectures poses new challenges to the MPI programming model which has been designed for distributed memory message passing. It is now clear that MPI will have to evolve in order to exploit shared-memory parallelism, either by collaborating with other programming models (MPI+X) or by introducing new shared-memory approaches. This paper considers extensions to C and C++ to make it possible for MPI Processes to run into threads. More generally, a thread-local storage (TLS) library is developed to simplify the collocation of arbitrary tasks and services in a shared-memory context called a task-container. The paper discusses how such containers simplify model and service mixing at the OS process level, eventually easing the collocation of arbitrary tasks with MPI processes in a runtime agnostic fashion, opening alternatives to runtime stacking.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116034978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance comparison of Eulerian kinetic Vlasov code between flat-MPI parallelism and hybrid parallelism on Fujitsu FX100 supercomputer","authors":"T. Umeda, K. Fukazawa","doi":"10.1145/2966884.2966891","DOIUrl":"https://doi.org/10.1145/2966884.2966891","url":null,"abstract":"The present study deals with the Vlasov simulation code, which solves the first-principle kinetic equations called the Vlasov equation for space plasma. In the present study, a five-dimensional Vlasov code with two spatial dimension and three velocity dimensions is parallelized with two methods, the flat-MPI and the MPI-OpenMP hybrid parallelism. The two types of the parallel Vlasov code are benchmarked on massively-parallel supercomputer Fujitsu FX100, which has been developed with the second-generation post architecture of the K computer in Japan. In the present performance comparison, we vary the number of threads per nodes from 1 (the flat-MPI parallelism) to 32. The result shows that the OpenMP-MPI hybrid parallelism outperforms the flat-MPI for any number of compute nodes. There is an optimum number of threads per nodes depending on the number of compute nodes. It is shown that the optimum number of threads per node becomes larger on a larger number of compute nodes. This is because the communication time of an MPI collective communication subroutine, which is used for convergence check of iterative methods, can be reduced by decreasing the total number of processes with the OpenMP-MPI hybrid parallelism.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"35 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126333810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Space Performance Tradeoffs in Compressing MPI Group Data Structures","authors":"Sameer Kumar, P. Heidelberger, C. Stunkel","doi":"10.1145/2966884.2966911","DOIUrl":"https://doi.org/10.1145/2966884.2966911","url":null,"abstract":"MPI is a popular programming paradigm on parallel machines today. MPI libraries sometimes use O(N) data structures to implement MPI functionality. The IBM Blue Gene/Q machine has 16 GB memory per node. If each node runs 32 MPI processes, only 512 MB is available per process, requiring the MPI library to be space efficient. This scenario will become severe in a future Exascale machine with tens of millions of cores and MPI endpoints. We explore techniques to compress the dense O(N) mapping data structures that map the logical process ID to the global rank. Our techniques minimize topological communicator mapping state by replacing table lookups with a mapping function. We also explore caching schemes with performance results to optimize overheads of the mapping functions for recent translations in multiple MPI micro-benchmarks, and the 3D FFT and Algebraic Multi Grid application benchmarks.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129727665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaías A. Comprés Ureña, Ao Mo-Hellenbrand, M. Gerndt, H. Bungartz
{"title":"Infrastructure and API Extensions for Elastic Execution of MPI Applications","authors":"Isaías A. Comprés Ureña, Ao Mo-Hellenbrand, M. Gerndt, H. Bungartz","doi":"10.1145/2966884.2966917","DOIUrl":"https://doi.org/10.1145/2966884.2966917","url":null,"abstract":"Dynamic Processes support was added to MPI in version 2.0 of the standard. This feature of MPI has not been widely used by application developers in part due to the performance cost and limitations of the spawn operation. In this paper, we propose an extension to MPI that consists of four new operations. These operations allow an application to be initialized in an elastic mode of execution and enter an adaptation window when necessary, where resources are incorporated into or released from the application's world communicator. A prototype solution based on the MPICH library and the SLURM resource manager is presented and evaluated alongside an elastic scientific application that makes use of the new MPI extensions. The cost of these new operations is shown to be negligible due mainly to the latency hiding design, leaving the application's time for data redistribution as the only significant performance cost.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129351777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Memory Implementation Strategies for the kinetic Monte Carlo Algorithm","authors":"António Esteves, Alfredo Moura","doi":"10.1145/2966884.2966908","DOIUrl":"https://doi.org/10.1145/2966884.2966908","url":null,"abstract":"This paper presents strategies to parallelize a previously implemented kinetic Monte Carlo (kMC) algorithm. The process under simulation is the precipitation in an aluminum scandium alloy. The selected parallel algorithm is called synchronous parallel kinetic Monte Carlo (spkMC). spkMC was implemented with a distributed memory architecture and using the Message Passing Interface (MPI) communication protocol. In spkMC the different processes synchronize at regular points, called end of sprint. During a sprint there is no interaction among processes. A checker board scheme was adopted to avoid possible conflicts among processes during each sprint. To optimize performance different implementations were explored, each one with a different computation vs. communication strategy. The obtained results prove that a rigorous distributed and parallel implementation reproduces accurately the statistical behavior observed with the sequential kMC. Results also prove that simulation time can be reduced with a distributed parallelization but, due to the non-deterministic nature of kMC, significant and scalable gains in parallelization oblige to introduce some simplifications and approximations.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133176768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott Levy, Kurt B. Ferreira, Patrick M. Widener, P. Bridges, Oscar H. Mondragon
{"title":"How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms","authors":"Scott Levy, Kurt B. Ferreira, Patrick M. Widener, P. Bridges, Oscar H. Mondragon","doi":"10.1145/2966884.2966920","DOIUrl":"https://doi.org/10.1145/2966884.2966920","url":null,"abstract":"Scientific workloads running on current extreme-scale systems routinely generate tremendous volumes of data for postprocessing. This data movement has become a serious issue due to its energy cost and the fact that I/O bandwidths have not kept pace with data generation rates. In situ analytics is an increasingly popular alternative in which post-simulation processing is embedded into an application, running as part of the same MPI job. This can reduce data movement costs but introduces a new potential source of interference for the application. Using a validated simulation-based approach, we investigate how best to mitigate the interference from time-shared in situ tasks for a number of key extreme-scale workloads. This paper makes a number of contributions. First, we show that the independent scheduling of in situ analytics tasks can significantly degradation application performance, with slowdowns exceeding 1000%. Second, we demonstrate that the degree of synchronization found in many modern collective algorithms is sufficient to significantly reduce the overheads of this interference to less than 10% in most cases. Finally, we show that many applications already frequently invoke collective operations that use these synchronizing MPI algorithms. Therefore, the syncronization introduced by these MPI collective algorithms can be leveraged to efficiently schedule analytics tasks with minimal changes to existing applications. This paper provides critical analysis and guidance for MPI users and developers on the importance of scheduling in situ analytics tasks. It shows the degree of synchronization needed to mitigate the performance impacts of these time-shared coupled codes and demonstrates how that synchronization can be realized in an extreme-scale environment using modern collective algorithms.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131671974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Library for Advanced Datatype Programming","authors":"J. Träff","doi":"10.1145/2966884.2966904","DOIUrl":"https://doi.org/10.1145/2966884.2966904","url":null,"abstract":"We present a library providing functionality beyond the MPI standard for manipulating application data layouts described by MPI derived datatypes. The main contributions are: a) Constructors for several, new datatypes for describing application relevant data layouts. b) A set of extent-free constructors that eliminate the need for type resizing. c) New navigation and query functionality for accessing individual data elements in layouts described by datatypes, and for comparing layouts. d) Representation of datatype signatures by explicit, associated signature types, as well as functionality for explicit generation of type maps. As a simple application, we implement reduction collectives on noncontiguous, but homogeneous derived datatypes. Some of the proposed functionality could be implemented more efficiently within an MPI library.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130906533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Potential of Diffusive Load Balancing at Large Scale","authors":"Matthias Lieber, Kerstin Gößner, W. Nagel","doi":"10.1145/2966884.2966887","DOIUrl":"https://doi.org/10.1145/2966884.2966887","url":null,"abstract":"Dynamic load balancing with diffusive methods is known to provide minimal load transfer and requires communication between neighbor nodes only. These are very attractive properties for highly parallel systems. We compare diffusive methods with state-of-the-art geometrical and graph-based partitioning methods on thousands of nodes. When load balancing overheads, i.e. repartitioning computation time and migration, have to be minimized, diffusive methods provide substantial benefits.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132098519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Dongarra, Daniel Holmes, A. Collis, J. Träff, Lorna Smith
{"title":"Proceedings of the 23rd European MPI Users' Group Meeting","authors":"J. Dongarra, Daniel Holmes, A. Collis, J. Träff, Lorna Smith","doi":"10.1145/2966884","DOIUrl":"https://doi.org/10.1145/2966884","url":null,"abstract":"EuroMPI is the preeminent meeting for users, developers and researchers to interact and discuss new developments and applications of message-passing parallel computing, in particular in and related to the Message Passing Interface (MPI). The annual meeting has a long, rich tradition, and were held in Madrid (2013), Vienna (2012), Santorini (2011), Stuttgart (2010), Espoo (2009), Dublin (2008), Paris (2007), Bonn (2006), Sorrento (2005), Budapest (2004), Venice (2003), Linz (2002), Santorini (2001), Balatonfured (2000), Barcelona (1999), Liverpool (1998), Cracow (1997), Munich (1996), Lyon (1995), and Rome (1994).","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123142699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}