Jithin Jose, S. Potluri, H. Subramoni, Xiaoyi Lu, Khaled Hamidouche, K. Schulz, H. Sundar, D. Panda
{"title":"Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models","authors":"Jithin Jose, S. Potluri, H. Subramoni, Xiaoyi Lu, Khaled Hamidouche, K. Schulz, H. Sundar, D. Panda","doi":"10.1145/2676870.2676880","DOIUrl":"https://doi.org/10.1145/2676870.2676880","url":null,"abstract":"While Hadoop holds the current Sort Benchmark record, previous research has shown that MPI-based solutions can deliver similar performance. However, most existing MPI-based designs rely on two-sided communication semantics. The emerging Partitioned Global Address Space (PGAS) programming model presents a flexible way to express parallelism for data-intensive applications. However, not all portions of the data analytics applications are amenable to conversion using PGAS models. In this study, we propose a novel design of the out-of-core, k-way parallel sort algorithm that takes advantage of the features of both MPI and OpenSHMEM PGAS models. To the best of our knowledge, this is the first design of any data intensive computing application using Hybrid MPI + PGAS models. Our experimental evaluation indicates that our proposed framework outperforms existing MPI-based design by up to 45% at 8,192 processes. It also achieves 7X improvement over Hadoop-based sort using the same amount of resources at 1,024 cores.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116421288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asymmetric Memory Extension for Openshmem","authors":"Latchesar Ionkov, Ginger Young","doi":"10.1145/2676870.2676887","DOIUrl":"https://doi.org/10.1145/2676870.2676887","url":null,"abstract":"Memory allocation in Openshmem is a global operation that requires in-step calls from all Processing Elements (PEs). Although this approach works with applications that split the work evenly, it prevents using Openshmem in cases where the workload and the memory it uses are allocated dynamically and can change significantly while the application is running. To broaden the cases where Openshmem can be used, we propose an extension -- asymmetric memory support. The extension allows PEs to allocate memory independently, and make this memory available for remote access from other PEs.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hartmut Kaiser, T. Heller, Bryce Adelstein-Lelbach, Adrian Serio, D. Fey
{"title":"HPX: A Task Based Programming Model in a Global Address Space","authors":"Hartmut Kaiser, T. Heller, Bryce Adelstein-Lelbach, Adrian Serio, D. Fey","doi":"10.1145/2676870.2676883","DOIUrl":"https://doi.org/10.1145/2676870.2676883","url":null,"abstract":"The significant increase in complexity of Exascale platforms due to energy-constrained, billion-way parallelism, with major changes to processor and memory architecture, requires new energy-efficient and resilient programming techniques that are portable across multiple future generations of machines. We believe that guaranteeing adequate scalability, programmability, performance portability, resilience, and energy efficiency requires a fundamentally new approach, combined with a transition path for existing scientific applications, to fully explore the rewards of todays and tomorrows systems. We present HPX -- a parallel runtime system which extends the C++11/14 standard to facilitate distributed operations, enable fine-grained constraint based parallelism, and support runtime adaptive resource management. This provides a widely accepted API enabling programmability, composability and performance portability of user applications. By employing a global address space, we seamlessly augment the standard to apply to a distributed case. We present HPX's architecture, design decisions, and results selected from a diverse set of application runs showing superior performance, scalability, and efficiency over conventional practice.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134373259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, S. Pophale, A. Welch, S. Poole, B. Chapman
{"title":"Fault Tolerance for OpenSHMEM","authors":"Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, S. Pophale, A. Welch, S. Poole, B. Chapman","doi":"10.1145/2676870.2676894","DOIUrl":"https://doi.org/10.1145/2676870.2676894","url":null,"abstract":"On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124672912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aroon Sharma, Darren Smith, Joshua Koehler, R. Barua, Michael P. Ferguson
{"title":"Affine Loop Optimization Based on Modulo Unrolling in Chapel","authors":"Aroon Sharma, Darren Smith, Joshua Koehler, R. Barua, Michael P. Ferguson","doi":"10.1145/2676870.2676877","DOIUrl":"https://doi.org/10.1145/2676870.2676877","url":null,"abstract":"This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-trivial run time overhead, a significant component of which is independent of the size of the message. Therefore, aggregating messages improves performance. Our optimization for message aggregation is based on a technique known as modulo unrolling, pioneered by Barua [3], whose purpose was to ensure a statically predictable single tile number for each memory reference for tiled architectures, such as the MIT Raw Machine [18]. Modulo unrolling WU applies to data that is distributed in a cyclic or block-cyclic manner. In this paper, we adapt the aforementioned modulo unrolling technique to the difficult problem of efficiently compiling PGAS languages to message passing architectures. When applied to loops and data distributed cyclically or block-cyclically, modulo unrolling WU can decide when to aggregate messages thereby reducing the overall message count and runtime for a particular loop. Compared to other methods, modulo unrolling WU greatly simplifies the complex problem of automatic code generation of message passing code. It also results in substantial performance improvement compared to the non-optimized Chapel compiler.\u0000 To implement this optimization in Chapel, we modify the leader and follower iterators in the Cyclic and Block Cyclic data distribution modules. Results were collected that compare the performance of Chapel programs optimized with modulo unrolling WU and Chapel programs using the existing Chapel data distributions. Data collected on a ten-locale cluster show that on average, modulo unrolling WU used with Chapel's Cyclic distribution results in 64 percent fewer messages and a 36 percent decrease in runtime for our suite of benchmarks. Similarly, modulo unrolling WU used with Chapel's Block Cyclic distribution results in 72 percent fewer messages and a 53 percent decrease in runtime.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128262537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Luo, Kayla Seager, K. S. Murthy, C. Archer, S. Sur, Sean Hefty
{"title":"Early Evaluation of Scalable Fabric Interface for PGAS Programming Models","authors":"Miao Luo, Kayla Seager, K. S. Murthy, C. Archer, S. Sur, Sean Hefty","doi":"10.1145/2676870.2676871","DOIUrl":"https://doi.org/10.1145/2676870.2676871","url":null,"abstract":"Inter-processor communication is a critical factor for performance at scale. In order to achieve good performance, communication overheads should be minimized. The fabric interface library plays a major role in determining the communication overheads. This is very important for the Partitioned Global Address Space (PGAS) programming models, as these models have been designed for very low-overhead remote memory access.\u0000 The OpenFabrics Alliance has recently initiated an effort to revamp fabric communication interface to better suit parallel programming models. The new open-source interface is being called Scalable Fabric Interface (SFI). The chief distinguishing feature being that the new interfaces are being co-designed along with the applications that use them, such as PGAS communication libraries.\u0000 In this paper we present an early evaluation of the mapping of PGAS libraries by implementing prototypes of the popular GASNet library and OpenSHMEM over SFI. Our analysis indicates overheads of mapping to SFI are significantly lower than to the current OpenFabrics Verbs communication interface. We can reduce the number of instructions in mapping GASNet to SFI by 82%, Berkeley UPC over GASNet to SFI by 80%, and OpenSHMEM to SFI by 95% as compared to similar mappings to OpenFabrics Verbs interface.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingzhe Li, Jian Lin, Xiaoyi Lu, Khaled Hamidouche, K. Tomko, D. Panda
{"title":"Scalable MiniMD Design with Hybrid MPI and OpenSHMEM","authors":"Mingzhe Li, Jian Lin, Xiaoyi Lu, Khaled Hamidouche, K. Tomko, D. Panda","doi":"10.1145/2676870.2676893","DOIUrl":"https://doi.org/10.1145/2676870.2676893","url":null,"abstract":"The MPI programming model has been widely used for scientific applications. The emergence of Partitioned Global Address Space (PGAS) programming models presents an alternative approach to improve programmability. With the global data view and lightweight communication operations, PGAS has the potential to increase the performance of scientific applications at scale. However, since the PGAS models are emerging, it is unlikely that entire applications will be re-written with them. Instead, unified communication runtimes have paved the way for a new class of hybrid applications that can leverage the benefits of both MPI and PGAS models. In this paper, we re-design an existing MPI based scientific mini-application (MiniMD) with MPI and OpenSHMEM programming models. We propose two alternative designs using MPI and OpenSHMEM programming models and compare performance and scalability of those designs with the original MPI-based implementation. Our performance evaluations using MVAPICH2-X (Unified MPI+PGAS Communication Runtime over InfiniBand) show a 17% reduction in total execution time, compared to existing MPI-based design with 1,024 cores.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127723494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Fanfarillo, T. Burnus, V. Cardellini, S. Filippone, D. Nagle, D. Rouson
{"title":"OpenCoarrays: Open-source Transport Layers Supporting Coarray Fortran Compilers","authors":"A. Fanfarillo, T. Burnus, V. Cardellini, S. Filippone, D. Nagle, D. Rouson","doi":"10.1145/2676870.2676876","DOIUrl":"https://doi.org/10.1145/2676870.2676876","url":null,"abstract":"Coarray Fortran is a set of features of the Fortran 2008 standard that make Fortran a PGAS parallel programming language. Two commercial compilers currently support coarrays: Cray and Intel. Here we present two coarray transport layers provided by the new OpenCoarrays project: one library based on MPI and the other on GASNet. We link the GNU Fortran (GFortran) compiler to either of the two OpenCoarrays implementations and present performance comparisons between executables produced by GFortran and the Cray and Intel compilers. The comparison includes synthetic benchmarks, application prototypes, and an application kernel. In our tests, Intel outperforms GFortran only on intra-node small transfers (in particular, scalars). GFortran outperforms Intel on intra-node array transfers and in all settings that require inter-node transfers. The Cray comparisons are mixed, with either GFortran or Cray being faster depending on the chosen hardware platform, network, and transport layer.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131135717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multithreaded Communication Substrate for OpenSHMEM","authors":"Aurélien Bouteiller, T. Hérault, G. Bosilca","doi":"10.1145/2676870.2676895","DOIUrl":"https://doi.org/10.1145/2676870.2676895","url":null,"abstract":"OpenSHMEM scalability is strongly dependent on the capability of its communication layer to efficiently handle multiple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Communication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some operations, compared to serializing communication.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114194783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Energy and Performance of PGAS-based Data Access Patterns","authors":"Siddhartha Jana, Joseph Schuchart, B. Chapman","doi":"10.1145/2676870.2676882","DOIUrl":"https://doi.org/10.1145/2676870.2676882","url":null,"abstract":"One of the factors associated with the usability of distributed programming models in exascale machines, is the energy and power cost associated with data movement across large-scale systems. PGAS implementations provide users with explicit interfaces for one-sided transfers to remote processes. However, a number of factors across the software stack have the potential of significantly impacting the energy signatures of communication-intensive applications that rely on such transfers. Performance characteristics like the use of non-blocking communication, the actual count of number of initiated transfers, the size of data payload packed within each transfer, as well as the use of pinned-down user buffers, all contribute to this impact.\u0000 In this paper, we discuss a number of RDMA-based communication patterns that are frequently incorporated within applications and communication libraries and, that have the potential of significantly impacting the energy and performance characteristics. We present an empirical study of the potential energy savings achievable by studying the impact on the CPU and DRAM. Since performance is a major criteria for PGAS programming models, we use the energy-delay product as a metric to justify the feasibility of these transformations.\u0000 We hope that this work motivates the incorporation of energy-based metrics for fine tuning PGAS implementations.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122584098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}