ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832244

Hartmut Kaiser, T. Heller, Daniel Bourgeois, D. Fey

{"title":"Higher-level parallelization for local and distributed asynchronous task-based programming","authors":"Hartmut Kaiser, T. Heller, Daniel Bourgeois, D. Fey","doi":"10.1145/2832241.2832244","DOIUrl":"https://doi.org/10.1145/2832241.2832244","url":null,"abstract":"One of the biggest challenges on the way to exascale computing is programmability in the context of performance portability. The efficient utilization of the prospective architectures of exascale supercomputers will be challenging in many ways, very much because of a massive increase of on-node parallelism, and an increase of complexity of memory hierarchies. Parallel programming models need to be able to formulate algorithms that allow exploiting these architectural peculiarities. The recent revival of interest in the industry and wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications. Among those efforts is the development of seamlessly integrating various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous execution flows, continuation style computation, and explicit fork-join control flow of independent and non-homogeneous code paths. Those proposals are the foundation of a powerful high-level abstraction that allows C++ codes to deal with an ever increasing architectural complexity in recent hardware developments.\u0000 In this paper, we present the results of developing those higher level parallelization facilities in HPX, a general purpose C++ runtime system for applications of any scale. The developed higher-level parallelization APIs have been designed to overcome the limitations of today's prevalently used programming models in C++ codes. HPX exposes a uniform higher-level API which gives the application programmer syntactic and semantic equivalence of various types of on-node and off-node parallelism, all of which are well integrated into the C++ type system. We show that these higher level facilities which are fully aligned with modern C++ programming concepts, are easily extensible, fully generic, and enable highly efficient parallelization on par with or better than existing equivalent applications based on OpenMP and/or MPI.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115128707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Fault tolerance features of a new multi-SPMD programming/execution environment 新的多spmd编程/执行环境的容错特性

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832243

Miwako Tsuji, S. Petiton, M. Sato

引用次数: 3

PPL: an abstract runtime system for hybrid parallel programming PPL:一个用于混合并行编程的抽象运行时系统

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832246

Alex Brooks, Hoang-Vu Dang, Nikoli Dryden, M. Snir

{"title":"PPL: an abstract runtime system for hybrid parallel programming","authors":"Alex Brooks, Hoang-Vu Dang, Nikoli Dryden, M. Snir","doi":"10.1145/2832241.2832246","DOIUrl":"https://doi.org/10.1145/2832241.2832246","url":null,"abstract":"Hardware trends indicate that supercomputers will see fast growing intra-node parallelism. Future programming models will need to carefully manage the interaction between inter- and intra-node parallelism to cope with this evolution. There exist many programming models which expose both levels of parallelism. However, they do not scale well as per-node thread counts rise and there is limited interoperability between threading and communication, leading to unnecessary software overheads and an increased amount of unnecessary communication. To address this, it is necessary to understand the limitations of current models and develop new approaches.\u0000 We propose a new runtime system design, PPL, which abstracts important high-level concepts of a typical parallel system for distributed-memory machines. By modularizing these elements, layers can be tested to better understand the needs of future programming models. We present details of the design and development implementation of PPL in C++11 and evaluate the performance of several different module implementations through micro-benchmarks and three applications: Barnes-Hut, Monte Carlo particle tracking, and a sparse-triangular solver.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The scalable petascale data-driven approach for the Cholesky factorization with multiple GPUs 基于多gpu的可扩展千兆级数据驱动的Cholesky分解方法

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832245

Yuki Tsujita, Toshio Endo, K. Fujisawa

引用次数: 2

ACPdl: data-structure and global memory allocator library over a thin PGAS-layer ACPdl:瘦pgas层上的数据结构和全局内存分配器库

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832242

Yuichiro Ajima, Takafumi Nose, K. Saga, Naoyuki Shida, S. Sumimoto

{"title":"ACPdl: data-structure and global memory allocator library over a thin PGAS-layer","authors":"Yuichiro Ajima, Takafumi Nose, K. Saga, Naoyuki Shida, S. Sumimoto","doi":"10.1145/2832241.2832242","DOIUrl":"https://doi.org/10.1145/2832241.2832242","url":null,"abstract":"HPC systems comprise an increasing number of processor cores towards the exascale computing era. As the number of parallel processes on a system increases, the number of point-to-point connections for each process increases and the memory usage of connections becomes an issue. A new communication library called Advanced Communication Primitives (ACP) is being developed to address the issue by providing communication functions with the Partitioned Global Address Space (PGAS) model that is potentially connection-less. The ACP library is designed to underlie domain-specific languages or parallel language runtimes. The ACP basic layer (ACPbl) comprises a minimum set of functions to abstract interconnect devices and to provide an address translation mechanism. As far as using ACPbl, global address can be granted only to local memory. In this paper, a new set of functions called the ACP data library (ACPdl) including global memory allocator and data-structure library is introduced to improve the productivity of the ACP library. The global memory allocator allocates a memory region of a remote process and assigns global address to it without involving the remote process. The data-structure library uses the global memory allocator internally and provides functions to create, read, update and delete distributed data-structures. Evaluation results of global memory allocator and associative-array data-structure functions show that overhead between the main and communication threads may become a bottleneck when an implementation of ACPbl uses a low latency HPC-dedicated interconnect device.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124167913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Task characterization-driven scheduling of multiple applications in a task-based runtime 在基于任务的运行时中对多个应用程序进行任务特征驱动的调度

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832248

K. Chandrasekar, B. Seshasayee, Ada Gavrilovska, K. Schwan

{"title":"Task characterization-driven scheduling of multiple applications in a task-based runtime","authors":"K. Chandrasekar, B. Seshasayee, Ada Gavrilovska, K. Schwan","doi":"10.1145/2832241.2832248","DOIUrl":"https://doi.org/10.1145/2832241.2832248","url":null,"abstract":"Task-based runtimes like OCR, X10 and Charm++ promise to address scalability challenges on Exascale machines due to their finegrained parallelism, inherent asynchrony, and consequent efficient localized synchronization. Although such runtimes are typically used to run a single application at a time, a common HPC scenario involves running a producer simulation application co-located with a consumer analytics application, to reduce data movement costs. The potentially diverse requirements of such co-located applications present challenges to the ability of the runtime to efficiently manage underlying resources, maintain application performance, and minimize sharing effects on application progress. To address this, we implement and study techniques based on application task characterization to improve resource utilization in shared task-based runtimes. We demonstrate that, by maintaining tasks characteristics, such as their compute, cache or memory intensity, using offline and/or online methods, we can improve task-based runtimes' ability to schedule and place tasks to minimize resource contention for co-running applications. Results are obtained via experimentation with the Open Community Runtime (OCR) on two distinct platforms ---an x86-based machine and a research platform based on the experimental Traleika Glacier (TG) architecture. On the x86, we see a performance improvement of 15% and on TG, we observe a reduction of energy usage by more than 50%, illustrating the potential benefits of the approach for next generation exascale platforms.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115458111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Hyper-Q aware intranode MPI collectives on the GPU GPU上的Hyper-Q感知内部网MPI集合

ESPM '15 Pub Date : 2015-11-15 DOI: 10.1145/2832241.2832247

Iman Faraji, A. Afsahi

{"title":"Hyper-Q aware intranode MPI collectives on the GPU","authors":"Iman Faraji, A. Afsahi","doi":"10.1145/2832241.2832247","DOIUrl":"https://doi.org/10.1145/2832241.2832247","url":null,"abstract":"In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching overhead among processes accessing the GPU and allowing multiple intranode processes to further overlap their CUDA tasks on the GPU and potentially share its resources through the Hyper-Q feature. Prior to MPS, Hyper-Q could only provide such resource sharing within a single process. In this paper, we evaluate the effect of the MPS service on the GPU communications with the focus on CUDA IPC and host-staged copies. We provide evidence that utilizing the MPS service is beneficial on multiple interprocess communications using these copy types. However, we show that efficient design decisions are required to further harness the potential of this service. To this aim, we propose a Static algorithm and Dynamic algorithm that can be applied to various intranode MPI collective operations, and as a test case we provide the results for the MPI_Allreduce operation. Both approaches, while following different algorithms, use a combination of the host-staged and CUDA IPC copies for the interprocess communications of their collective designs. By selecting the right number and type of the copies, our algorithms are capable of efficiently leveraging the MPS and Hyper-Q feature and provide improvement over MVAPICH2 and MVAPICH2-GDR for most of the medium and all of the large messages. Our results suggest that the Dynamic algorithm is comparable with the Static algorithm, while is independent of any tuning table and thus can be portable across platforms.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124433567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

ESPM '15最新文献