{"title":"Proceedings of the 25th European MPI Users' Group Meeting","authors":"","doi":"10.1145/3236367","DOIUrl":"https://doi.org/10.1145/3236367","url":null,"abstract":"","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131390870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger
{"title":"Transparent High-Speed Network Checkpoint/Restart in MPI","authors":"Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/3236367.3236383","DOIUrl":"https://doi.org/10.1145/3236367.3236383","url":null,"abstract":"Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127113328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nawrin Sultana, A. Skjellum, I. Laguna, M. Farmer, K. Mohror, M. Emani
{"title":"MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications","authors":"Nawrin Sultana, A. Skjellum, I. Laguna, M. Farmer, K. Mohror, M. Emani","doi":"10.1145/3236367.3236385","DOIUrl":"https://doi.org/10.1145/3236367.3236385","url":null,"abstract":"When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes---this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of \"MPI Stages.\" MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. This approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart. In this paper, we identify requirements for production MPI implementations to support state checkpointing with MPI Stages, which includes capturing and managing internal MPI state and serializing and deserializing user handles to MPI objects. We evaluate our fault tolerance approach with a proof-of-concept prototype MPI implementation that includes MPI Stages. We demonstrate its functionality and performance using LULESH and microbenchmarks. Our results show that MPI Stages reduces the recovery time by 13× for LULESH in comparison to checkpoint/restart.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132998228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-efficient localised rollback via data flow analysis and frequency scaling","authors":"K. Dichev, K. Cameron, Dimitrios S. Nikolopoulos","doi":"10.1145/3236367.3236379","DOIUrl":"https://doi.org/10.1145/3236367.3236379","url":null,"abstract":"Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"54 52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121142024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiao Kang, J. Träff, Reda Al-Bahrani, Ankit Agrawal, A. Choudhary, W. Liao
{"title":"Full-Duplex Inter-Group All-to-All Broadcast Algorithms with Optimal Bandwidth","authors":"Qiao Kang, J. Träff, Reda Al-Bahrani, Ankit Agrawal, A. Choudhary, W. Liao","doi":"10.1145/3236367.3236374","DOIUrl":"https://doi.org/10.1145/3236367.3236374","url":null,"abstract":"MPI inter-group collective communication patterns can be viewed as bipartite graphs that divide processes into two disjoint groups in which messages are transferred between but not within the groups. Such communication patterns can serve as basic operations for scientific application workflows. In this paper, we present parallel algorithms for inter-group all-to-all broadcast (Allgather) communication with optimal bandwidth for any message size and process number under single-port communication constraints. We implement the algorithms using MPI point-to-point and intra-group collective communication functions and evaluate their performance on the Cori supercomputer at NERSC. Using message sizes ranging from 256B to 64MB, the experiments show a significant performance improvement achieved by our algorithm, which is up to 9.27 times faster than production MPI libraries that adopt the so called root-gathering algorithm.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128879350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc-André Hermanns, N. Hjelm, Michael Knobloch, K. Mohror, M. Schulz
{"title":"Enabling callback-driven runtime introspection via MPI_T","authors":"Marc-André Hermanns, N. Hjelm, Michael Knobloch, K. Mohror, M. Schulz","doi":"10.1145/3236367.3236370","DOIUrl":"https://doi.org/10.1145/3236367.3236370","url":null,"abstract":"Understanding the behavior of parallel applications that use the Message Passing Interface (MPI) is critical for optimizing communication performance. Performance tools for MPI currently rely on the PMPI Profiling Interface or the MPI Tools Information Interface, MPI_T, for portably collecting information for performance measurement and analysis. While tools using these interfaces have proven to be extremely valuable for performance tuning, these interfaces only provide synchronous information, i.e., when an MPI or an MPI_T function is called. There is currently no option for collecting information about asynchronous events from within the MPI library. In this work we propose a callback-driven interface for event notification from MPI implementations. Our approach is integrated in the existing MPI_T interface and provides a portable API for tools to discover and register for events of interest. We demonstrate the functionality and usability of the interface with a prototype implementation in Open MPI, a small logging tool (MEL) and the measurement infrastructure Score-P.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127706489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Node Information to Implement MPI Cartesian Topologies","authors":"W. Gropp","doi":"10.1145/3236367.3236377","DOIUrl":"https://doi.org/10.1145/3236367.3236377","url":null,"abstract":"The MPI API provides support for Cartesian process topologies, including the option to reorder the processes to achieve better communication performance. But MPI implementations rarely provide anything useful for the reorder option, typically ignoring it. One argument made is that modern interconnects are fast enough that applications are less sensitive to the exact layout of processes onto the system. However, intranode communication performance is much greater than internode communication performance. In this paper, we show a simple approach that takes into account only information about which MPI processes are on the same node to provide a fast and effective implementation of the MPI Cartesian topology. While not optimal, this approach provides a significant improvement over all tested MPI implementations and provides an implementation that may be used as the default in any MPI implementation of MPI_Cart_create.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116932645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MC-CChecker: A Clock-Based Approach to Detect Memory Consistency Errors in MPI One-Sided Applications","authors":"Thanh-Dang Diep, K. Fürlinger, N. Thoai","doi":"10.1145/3236367.3236369","DOIUrl":"https://doi.org/10.1145/3236367.3236369","url":null,"abstract":"MPI one-sided communication decouples data movement from synchronization, which eliminates overhead from unneeded synchronization and allows for greater concurrency. On the one hand this fact is the great advantage of MPI one-sided communication, but on the other, it poses enormous challenges for programmers in preserving the reliability of programs. Memory consistency errors are notorious for degrading reliability as well as performance of MPI one-sided applications. Even an MPI expert can easily make these mistakes. The lockopts bug occurred in an RMA test case that is part of MPICH MPI implementation is an example for this situation. Hence, detecting memory consistency errors is extremely challenging. MC-Checker is the most cutting-edge debugger to address these errors effectively. MC-Checker tackles the memory consistency errors based on the happened-before relation. Taking full advantage of the relation makes DN-Analyzer of MC-Checker difficult to scale well. For that reason, MC-Checker does ignore the transitive ordering of the happened-before relation to retain scalability of DN-Analyzer. Consequently, MC-Checker is highly able to impose a potential source of false positives. In order to overcome this issue, we present a novel clock-based approach called MC-CChecker with the aim of fully preserving the happened-before relation by making use of an encoded vector clock. MC-CChecker inherits distinguishing features from MC-Checker by reusing ST-Analyzer and Profiler while focusing mainly on the optimization of DN-Analyzer. The experimental findings prove that MC-CChecker not only effectively detects memory consistency errors as MC-Checker did, but also completely eliminates the potential source of false positives which is a major limitation of MC-Checker while still retaining acceptable overheads of execution time and memory usage for DN-Analyzer. Especially, DN-Analyzer of MC-CChecker is fairly scalable when processing a large amount of trace files generated from running the lockopts up to 8192 processes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130128802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Valero-Lara, R. Sirvent, Antonio J. Peña, X. Martorell, Jesús Labarta
{"title":"MPI+OpenMP Tasking Scalability for the Simulation of the Human Brain: Human Brain Project","authors":"Pedro Valero-Lara, R. Sirvent, Antonio J. Peña, X. Martorell, Jesús Labarta","doi":"10.1145/3236367.3236373","DOIUrl":"https://doi.org/10.1145/3236367.3236373","url":null,"abstract":"The simulation of the behavior of the Human Brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work we focus on the most important European initiative (Human Brain Project) and on one of the tools (Arbor). This tool simulates the spikes triggered in a neuronal network by computing the voltage capacitance on the neurons' morphology, being one of the most precise simulators today. In the present work, we have evaluated the use of MPI+OpenMP tasking on top of the Arbor simulator. In this paper, we present the main characteristics of the Arbor tool and how these can be efficiently managed by using MPI+OpenMP tasking. We prove that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node using up to 32 nodes. Our target consists of achieving not only a highly scalable implementation based on MPI, but also to develop a tool with a high degree of abstraction without losing control and performance by using MPI+OpenMP tasking.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116722451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures","authors":"Mingzhe Li, Xiaoyi Lu, H. Subramoni, D. Panda","doi":"10.1145/3236367.3236371","DOIUrl":"https://doi.org/10.1145/3236367.3236371","url":null,"abstract":"Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as graph processing with irregular communication characteristics. To take advantage of a large number of hardware threads offered by KNL and POWER, HPDA applications and MPI RMA runtime need to be re-designed to get optimal performance. In this paper, we propose multi-threading and lock-free designs in the MPI runtime as well as Graph500 application on KNL and POWER architectures. At the micro-bench level, our proposed runtime-level designs are able to reduce the latency of uni-directional MPI_Put and MPI_Get by up to 3X compared to IntelMPI and Spectrum MPI. At the application level, with 1,024 processes on 32 KNL nodes, our proposed design could outperform IntelMPI library by 32%. With 512 processes on eight POWER nodes, our proposed design could outperform Spectrum MPI library by 19%. To the best of our knowledge, this is the first paper to design and evaluate MPI RMA-based graph processing applications on KNL and POWER architectures.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114339644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}