Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlic, Vivek Sarkar
{"title":"HabaneroUPC++: a Compiler-free PGAS Library","authors":"Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlic, Vivek Sarkar","doi":"10.1145/2676870.2676879","DOIUrl":"https://doi.org/10.1145/2676870.2676879","url":null,"abstract":"The Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, providing the basis for high performance and high productivity parallel programming environments. UPC++ [39] is a very recent PGAS implementation that takes a library-based approach and avoids the complexities associated with compiler transformations. However, this implementation does not support dynamic task parallelism and only relies on other threading models (e.g., OpenMP or pthreads) for exploiting parallelism within a PGAS place.\u0000 In this paper, we introduce a compiler-free PGAS library called HabaneroUPC++, which supports a tighter integration of intra-place and inter-place parallelism than standard hybrid programming approaches. The library makes heavy use of C++11 lambda functions in its APIs. C++11 lambdas avoid the need for compiler support while still retaining the syntactic convenience of language-based approaches. The HabaneroUPC++ library implementation is based on a tight integration of the UPC++ library and the Habanero-C++ library, with new extensions to support the integration. The UPC++ library is used to provide PGAS communication and function shipping support using GASNet, and the Habanero-C++ library is used to provide support for intra-place work-stealing integrated with function shipping. We demonstrate the programmability and performance of our implementation using two benchmarks, scaled up to 6K cores. The insights developed in this paper promise to further enhance the usability and popularity of PGAS programming models.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pavel Shamis, Manjunath Gorentla Venkata, S. Poole, S. Pophale, Mike Dubman, R. Graham, Dror Goldenberg, G. Shainer
{"title":"Development and Extension of Atomic Memory Operations in OpenSHMEM","authors":"Pavel Shamis, Manjunath Gorentla Venkata, S. Poole, S. Pophale, Mike Dubman, R. Graham, Dror Goldenberg, G. Shainer","doi":"10.1145/2676870.2676891","DOIUrl":"https://doi.org/10.1145/2676870.2676891","url":null,"abstract":"A distinguishing characteristic of OpenSHMEM compared to other PGAS programming model implementations is its support for atomic memory operations (AMOs). It provides a rich set of AMO interfaces supporting 32-bit and 64-bit datatypes. On most modern networks, network-implemented AMOs are known to outperform software-implemented AMOs. So, for achieving high-performance, an OpenSHMEM implementation should try to offload AMOs to the underlying network hardware when possible. Nevertheless, the challenge arises when (a) underlying hardware does not support full set of atomic operations, (b) more that one device is used, and (c) heterogeneous systems with multiple types of devices are involved. In this paper, we analyze the challenges and discuss potential solutions to address these challenges.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122793748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Shan, A. Kamil, Samuel Williams, Yili Zheng, K. Yelick
{"title":"Evaluation of PGAS Communication Paradigms with Geometric Multigrid","authors":"H. Shan, A. Kamil, Samuel Williams, Yili Zheng, K. Yelick","doi":"10.1145/2676870.2676874","DOIUrl":"https://doi.org/10.1145/2676870.2676874","url":null,"abstract":"Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124331845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One-Sided Append: A New Communication Paradigm For PGAS Models","authors":"James Dinan, Mario Flajslik","doi":"10.1145/2676870.2676886","DOIUrl":"https://doi.org/10.1145/2676870.2676886","url":null,"abstract":"One-sided append represents a new class of one-sided operations that can be used to aggregate messages from multiple communication sources into a single destination buffer. This new communication paradigm is analyzed in terms of its impact on the OpenSHMEM parallel programming model and applications. Implementation considerations are discussed and an accelerated implementation using the Portals 4 networking API is presented. Initial experimental results with the NAS integer sort benchmark indicate that this new operation can significantly improve the communication performance of such applications.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126962678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William N. Scherer, L. Adhianto, G. Jin, J. Mellor-Crummey, Chaoran Yang
{"title":"Hiding latency in Coarray Fortran 2.0","authors":"William N. Scherer, L. Adhianto, G. Jin, J. Mellor-Crummey, Chaoran Yang","doi":"10.1145/2020373.2020387","DOIUrl":"https://doi.org/10.1145/2020373.2020387","url":null,"abstract":"In Numrich and Reid's 1998 proposal [17], Coarray Fortran is a simple set of extensions to Fortran 95, principal among which is support for shared data known as coarrays. Responding to short-comings in the Fortran Standards Committee's addition of coarrays to the Fortran 2008 standards, we at Rice envisioned an extensive update which has come to be known as Coarray Fortran 2.0 [15]. In this paper, we chronicle the evolution of Coarray Fortran 2.0 as it gains support for asynchronous point-to-point and collective operations. We outline how these operations are implemented and describe code fragments from several benchmark programs to show we use these operations to hide latency by overlapping communication and computation.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133835423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An open-source compiler and runtime implementation for Coarray Fortran","authors":"Deepak Eachempati, H. Jun, B. Chapman","doi":"10.1145/2020373.2020386","DOIUrl":"https://doi.org/10.1145/2020373.2020386","url":null,"abstract":"Coarray Fortran (CAF) comprises a set of proposed language extensions to Fortran that are expected to be adopted as part of the Fortran 2008 standard. In contrast to prior open-source implementation efforts, our approach is to use a single, unified compiler infrastructure to translate, optimize and generate binaries from CAF codes. In this paper, we will describe our compiler and runtime implementation of CAF using an Open64-based compiler infrastructure. We will detail the process by which we generate a high-level intermediate representation from the CAF code in our compilers front-end, how our compiler analyzes and translate this IR to generate a binary which makes use of our runtime system, and how we support the runtime execution model with our runtime library. We have carried out experiments using both an ARMCI- and GASNet-based runtime implementation, and we present these results.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129053522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance modeling for multilevel communication in SHMEM+","authors":"V. Aggarwal, C. Yoon, A. George, H. Lam, G. Stitt","doi":"10.1145/2020373.2020380","DOIUrl":"https://doi.org/10.1145/2020373.2020380","url":null,"abstract":"The field of high-performance computing (HPC) is currently undergoing a major transformation brought upon by a variety of new processor device technologies. Accelerator devices (e.g. FPGA, GPU) are becoming increasingly popular as coprocessors in HPC, embedded, and other systems, improving application performance while in some cases also reducing energy consumption. The presence of such devices introduces additional levels of communication and memory hierarchy in the system, which warrants an expansion of conventional parallel-programming practices to address these differences. Programming models and libraries for heterogeneous, parallel, and reconfigurable computing such as SHMEM+ have been developed to support communication and coordination involving a diverse mix of processor devices. However, to evaluate the impact of communication on application performance and obtain optimal performance, a concrete understanding of the underlying communication infrastructure is often imperative. In this paper, we introduce a new multilevel communication model for representing various data transfers encountered in these systems and for predicting performance. Three use cases are presented and evaluated. First, the model enables application developers to perform early design-space exploration of communication patterns in their applications before undertaking the laborious and expensive process of implementation, yielding improved performance and productivity. Second, the model enables system developers to quickly optimize performance of data-transfer routines within tools such as SHMEM+ when being ported to a new platform. Third, the model augments tools such as SHMEM+ to automatically improve performance of data transfers by self-tuning internal parameters to match platform capabilities. Results from experiments with these use cases suggest marked improvement in performance, productivity, and portability.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115142949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving UPC productivity via integrated development tools","authors":"Max Billingsley, Beth Tibbitts, A. George","doi":"10.1145/2020373.2020381","DOIUrl":"https://doi.org/10.1145/2020373.2020381","url":null,"abstract":"In the world of high-performance computing (HPC), there has been an increased focus in recent years upon the importance of productivity in HPC application development. One crucial aspect of productivity is the programming model used, and the family of partitioned global-address-space (PGAS) models, such as UPC and X10, has served to advance the state of the art in balancing performance and productivity. Also of great importance is the variety of development tools used to support activities such as editing, debugging, and optimizing programs. These tools are often most useful as part of an integrated development environment (IDE). While some progress has been made towards bringing IDE capabilities into the HPC world, in particular by way of Eclipse projects, support has mainly focused on MPI and OpenMP tools.\u0000 In this paper, we present research and development activities that are bringing Eclipse-based IDE capabilities to the PGAS developer community. We focus on tools for UPC, giving background on previously existing capabilities to work with UPC programs in Eclipse and then presenting a tool-chain and project wizard for the open-source Berkeley UPC compiler, basic UPC static analysis tools, and integration of our performance analysis tool (Parallel Performance Wizard) supporting UPC. Finally, we conclude by proposing future work and providing recommendations for further integration of UPC and other PGAS tools to enhance overall developer productivity.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122106421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Blagojevic, Paul H. Hargrove, Costin Iancu, K. Yelick
{"title":"Hybrid PGAS runtime support for multicore nodes","authors":"F. Blagojevic, Paul H. Hargrove, Costin Iancu, K. Yelick","doi":"10.1145/2020373.2020376","DOIUrl":"https://doi.org/10.1145/2020373.2020376","url":null,"abstract":"With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122785000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronous PGAS runtime for Myrinet networks","authors":"Montse Farreras, G. Almási","doi":"10.1145/2020373.2020377","DOIUrl":"https://doi.org/10.1145/2020373.2020377","url":null,"abstract":"PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and performance portability, and it includes optimized implementations for LAPI and Blue Gene DCMF communication sub systems.\u0000 This paper presents an optimized implementation of the IBM APGAS runtime for Myrinet networks, on top of the MX communication library. It explains the challenges of implementing a one-sided communication model (APGAS) on top of a two-sided communication API such as MX.\u0000 We show that our implementation outperforms the Berkeley GASNet runtime in terms of latency and bandwidth. We also demonstrate scalability of various HPC benchmarks up to 1024 processes.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115394299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}