ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832244
Hartmut Kaiser, T. Heller, Daniel Bourgeois, D. Fey
{"title":"Higher-level parallelization for local and distributed asynchronous task-based programming","authors":"Hartmut Kaiser, T. Heller, Daniel Bourgeois, D. Fey","doi":"10.1145/2832241.2832244","DOIUrl":"https://doi.org/10.1145/2832241.2832244","url":null,"abstract":"One of the biggest challenges on the way to exascale computing is programmability in the context of performance portability. The efficient utilization of the prospective architectures of exascale supercomputers will be challenging in many ways, very much because of a massive increase of on-node parallelism, and an increase of complexity of memory hierarchies. Parallel programming models need to be able to formulate algorithms that allow exploiting these architectural peculiarities. The recent revival of interest in the industry and wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications. Among those efforts is the development of seamlessly integrating various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous execution flows, continuation style computation, and explicit fork-join control flow of independent and non-homogeneous code paths. Those proposals are the foundation of a powerful high-level abstraction that allows C++ codes to deal with an ever increasing architectural complexity in recent hardware developments.\u0000 In this paper, we present the results of developing those higher level parallelization facilities in HPX, a general purpose C++ runtime system for applications of any scale. The developed higher-level parallelization APIs have been designed to overcome the limitations of today's prevalently used programming models in C++ codes. HPX exposes a uniform higher-level API which gives the application programmer syntactic and semantic equivalence of various types of on-node and off-node parallelism, all of which are well integrated into the C++ type system. We show that these higher level facilities which are fully aligned with modern C++ programming concepts, are easily extensible, fully generic, and enable highly efficient parallelization on par with or better than existing equivalent applications based on OpenMP and/or MPI.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115128707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832243
Miwako Tsuji, S. Petiton, M. Sato
{"title":"Fault tolerance features of a new multi-SPMD programming/execution environment","authors":"Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/2832241.2832243","DOIUrl":"https://doi.org/10.1145/2832241.2832243","url":null,"abstract":"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122484244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832246
Alex Brooks, Hoang-Vu Dang, Nikoli Dryden, M. Snir
{"title":"PPL: an abstract runtime system for hybrid parallel programming","authors":"Alex Brooks, Hoang-Vu Dang, Nikoli Dryden, M. Snir","doi":"10.1145/2832241.2832246","DOIUrl":"https://doi.org/10.1145/2832241.2832246","url":null,"abstract":"Hardware trends indicate that supercomputers will see fast growing intra-node parallelism. Future programming models will need to carefully manage the interaction between inter- and intra-node parallelism to cope with this evolution. There exist many programming models which expose both levels of parallelism. However, they do not scale well as per-node thread counts rise and there is limited interoperability between threading and communication, leading to unnecessary software overheads and an increased amount of unnecessary communication. To address this, it is necessary to understand the limitations of current models and develop new approaches.\u0000 We propose a new runtime system design, PPL, which abstracts important high-level concepts of a typical parallel system for distributed-memory machines. By modularizing these elements, layers can be tested to better understand the needs of future programming models. We present details of the design and development implementation of PPL in C++11 and evaluate the performance of several different module implementations through micro-benchmarks and three applications: Barnes-Hut, Monte Carlo particle tracking, and a sparse-triangular solver.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832245
Yuki Tsujita, Toshio Endo, K. Fujisawa
{"title":"The scalable petascale data-driven approach for the Cholesky factorization with multiple GPUs","authors":"Yuki Tsujita, Toshio Endo, K. Fujisawa","doi":"10.1145/2832241.2832245","DOIUrl":"https://doi.org/10.1145/2832241.2832245","url":null,"abstract":"The Cholesky factorization is an important linear algebra kernel which is used in the semidefinite programming (SDP) problem. However, the large computation costs for Cholesky factorization of the Schur complement matrix (SCM) has been obstacles to solve large scale problems. This paper describes a brand-new version of the parallel SDP solver, SDPARA, which has been equipped with a Cholesky factorization implementation and demonstrated 1.7PFlops performance with over two million constraints by using 4,080 GPUs. The performance and scalability is even more improved by introducing a data-driven approach, rather than traditional synchronous approach. Also we point out that typical data-driven implementations have limitation in scalability, and demonstrate the efficiency of the proposed approach via experiments on TSUBAME2.5 supercomputer.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121139546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832242
Yuichiro Ajima, Takafumi Nose, K. Saga, Naoyuki Shida, S. Sumimoto
{"title":"ACPdl: data-structure and global memory allocator library over a thin PGAS-layer","authors":"Yuichiro Ajima, Takafumi Nose, K. Saga, Naoyuki Shida, S. Sumimoto","doi":"10.1145/2832241.2832242","DOIUrl":"https://doi.org/10.1145/2832241.2832242","url":null,"abstract":"HPC systems comprise an increasing number of processor cores towards the exascale computing era. As the number of parallel processes on a system increases, the number of point-to-point connections for each process increases and the memory usage of connections becomes an issue. A new communication library called Advanced Communication Primitives (ACP) is being developed to address the issue by providing communication functions with the Partitioned Global Address Space (PGAS) model that is potentially connection-less. The ACP library is designed to underlie domain-specific languages or parallel language runtimes. The ACP basic layer (ACPbl) comprises a minimum set of functions to abstract interconnect devices and to provide an address translation mechanism. As far as using ACPbl, global address can be granted only to local memory. In this paper, a new set of functions called the ACP data library (ACPdl) including global memory allocator and data-structure library is introduced to improve the productivity of the ACP library. The global memory allocator allocates a memory region of a remote process and assigns global address to it without involving the remote process. The data-structure library uses the global memory allocator internally and provides functions to create, read, update and delete distributed data-structures. Evaluation results of global memory allocator and associative-array data-structure functions show that overhead between the main and communication threads may become a bottleneck when an implementation of ACPbl uses a low latency HPC-dedicated interconnect device.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124167913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832248
K. Chandrasekar, B. Seshasayee, Ada Gavrilovska, K. Schwan
{"title":"Task characterization-driven scheduling of multiple applications in a task-based runtime","authors":"K. Chandrasekar, B. Seshasayee, Ada Gavrilovska, K. Schwan","doi":"10.1145/2832241.2832248","DOIUrl":"https://doi.org/10.1145/2832241.2832248","url":null,"abstract":"Task-based runtimes like OCR, X10 and Charm++ promise to address scalability challenges on Exascale machines due to their finegrained parallelism, inherent asynchrony, and consequent efficient localized synchronization. Although such runtimes are typically used to run a single application at a time, a common HPC scenario involves running a producer simulation application co-located with a consumer analytics application, to reduce data movement costs. The potentially diverse requirements of such co-located applications present challenges to the ability of the runtime to efficiently manage underlying resources, maintain application performance, and minimize sharing effects on application progress. To address this, we implement and study techniques based on application task characterization to improve resource utilization in shared task-based runtimes. We demonstrate that, by maintaining tasks characteristics, such as their compute, cache or memory intensity, using offline and/or online methods, we can improve task-based runtimes' ability to schedule and place tasks to minimize resource contention for co-running applications. Results are obtained via experimentation with the Open Community Runtime (OCR) on two distinct platforms ---an x86-based machine and a research platform based on the experimental Traleika Glacier (TG) architecture. On the x86, we see a performance improvement of 15% and on TG, we observe a reduction of energy usage by more than 50%, illustrating the potential benefits of the approach for next generation exascale platforms.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115458111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ESPM '15Pub Date : 2015-11-15DOI: 10.1145/2832241.2832247
Iman Faraji, A. Afsahi
{"title":"Hyper-Q aware intranode MPI collectives on the GPU","authors":"Iman Faraji, A. Afsahi","doi":"10.1145/2832241.2832247","DOIUrl":"https://doi.org/10.1145/2832241.2832247","url":null,"abstract":"In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching overhead among processes accessing the GPU and allowing multiple intranode processes to further overlap their CUDA tasks on the GPU and potentially share its resources through the Hyper-Q feature. Prior to MPS, Hyper-Q could only provide such resource sharing within a single process. In this paper, we evaluate the effect of the MPS service on the GPU communications with the focus on CUDA IPC and host-staged copies. We provide evidence that utilizing the MPS service is beneficial on multiple interprocess communications using these copy types. However, we show that efficient design decisions are required to further harness the potential of this service. To this aim, we propose a Static algorithm and Dynamic algorithm that can be applied to various intranode MPI collective operations, and as a test case we provide the results for the MPI_Allreduce operation. Both approaches, while following different algorithms, use a combination of the host-staged and CUDA IPC copies for the interprocess communications of their collective designs. By selecting the right number and type of the copies, our algorithms are capable of efficiently leveraging the MPS and Hyper-Q feature and provide improvement over MVAPICH2 and MVAPICH2-GDR for most of the medium and all of the large messages. Our results suggest that the Dynamic algorithm is comparable with the Static algorithm, while is independent of any tuning table and thus can be portable across platforms.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124433567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}