{"title":"OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences","authors":"Khaled Hamidouche, Jie Zhang, D. Panda, K. Tomko","doi":"10.1109/PAW.2016.7","DOIUrl":"https://doi.org/10.1109/PAW.2016.7","url":null,"abstract":"PGAS models with a lightweight synchronization and shared memory abstraction, are seen as a good alternative to the Message Passing model for irregular communication patterns. OpenSHMEM is a library based PGAS model. OpenSHMEM 1.3 introduced Non-Blocking data movement operations to provide better asynchronous progress and overlap. In this paper, we present our experiences in designing Non-Blocking Put and Get operations on InfiniBand systems. Using the MVAPICH2-X runtime, we present the alternative designs for intra-node and inter-node operations. We also present a set of new benchmarks to analyze the latency, message rate performance, and communication/computation overlap benefits. The performance evaluation shows 7X improvement in the message rate. Furthermore, using a 3D-Stencil based application kernel, we assess the benefits of OpenSHMEM Non-Blocking extensions. We show 50% and 28% improvement on 27 and 64 processes, respectively.","PeriodicalId":383847,"journal":{"name":"2016 PGAS Applications Workshop (PAW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127556647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of PGAS Programming to Power Grid Simulation","authors":"B. Palmer","doi":"10.1109/PAW.2016.10","DOIUrl":"https://doi.org/10.1109/PAW.2016.10","url":null,"abstract":"This paper will describe the application of the PGAS Global Arrays (GA) library to power grid simulations. The GridPACK™ framework has been designed to enable power grid engineers to develop parallel simulations of the power grid by providing a set of templates and libraries that encapsulate most of the details of parallel programming in higher level abstractions. The communication portions of the framework are implemented using a combination of message-passing (MPI) and one-sided communication (GA). This paper will provide a brief overview of GA and describe in detail the implementation of collective hash tables, which are used in many power grid applications to match data with a previously distributed network.","PeriodicalId":383847,"journal":{"name":"2016 PGAS Applications Workshop (PAW)","volume":"279 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing PGAS Overhead in a Multi-locale Chapel Implementation of CoMD","authors":"Riyaz Haque, D. Richards","doi":"10.1109/PAW.2016.9","DOIUrl":"https://doi.org/10.1109/PAW.2016.9","url":null,"abstract":"Chapel supports distributed computing with an underlying PGAS memory address space. While it provides abstractions for writing simple and elegant distributed code, the type system currently lacks a notion of locality i.e. a description of an object's access behavior in relation to its actual location. This often necessitates programmer intervention to avoid redundant non-local data access. Moreover, due to insufficient locality information the compiler ends up using “wide” pointers—that can point to non-local data—for objects referenced in an otherwise completely local manner, adding to the runtime overhead.In this work we describe CoMD-Chapel, our distributed Chapel implementation of the CoMD benchmark. We demonstrate that optimizing data access through replication and localization is crucial for achieving performance comparable to the reference implementation. We discuss limitations of existing scope-based locality optimizations and argue instead for a more general (and robust) type-based approach. Lastly, we also evaluate code performance and scaling characteristics. The fully optimized version of CoMD-Chapel can perform to within 62%–87% of the reference implementation.","PeriodicalId":383847,"journal":{"name":"2016 PGAS Applications Workshop (PAW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Shterenlikht, L. Margetts, J. D. Arregui-Mena, L. Cebamanos
{"title":"Multi-scale CAFE Framework for Simulating Fracture in Heterogeneous Materials Implemented in Fortran Co-arrays and MPI","authors":"A. Shterenlikht, L. Margetts, J. D. Arregui-Mena, L. Cebamanos","doi":"10.1109/PAW.2016.6","DOIUrl":"https://doi.org/10.1109/PAW.2016.6","url":null,"abstract":"Fortran coarrays have been used as an extension to the standard for over 20 years, mostly on Cray systems. Their appeal to users increased substantially when they were standardised in 2010. In this work we show that coarrays offer simple and intuitive data structures for 3D cellular automata (CA) modelling of material microstructures. We show how coarrays can be used together with an MPI finite element (FE) library to create a two-way concurrent hierarchical and scalable multi-scale CAFE deformation and fracture framework. Design of a coarray cellular automata microstructure evolution library CGPACK is described. A highly portable MPI FE library ParaFEM was used in this work. We show that independently CGPACK and ParaFEM programs can scale up well into tens of thousands of cores. Strong scaling of a hybrid ParaFEM/CGPACK MPI/coarray multi-scale framework was measured on an important solid mechanics practical example of a fracture of a steel round bar under tension. That program did not scale beyond 7 thousand cores. Excessive synchronisation might be one contributing factor to relatively poor scaling. Therefore we conclude with a comparative analysis of synchronisation requirements in MPI and coarray programs. Specific challenges of synchronising a coarray library are discussed.","PeriodicalId":383847,"journal":{"name":"2016 PGAS Applications Workshop (PAW)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116525317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Shan, Samuel Williams, Yili Zheng, Weiqun Zhang, Bei Wang, S. Ethier, Zhengji Zhao
{"title":"Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication","authors":"H. Shan, Samuel Williams, Yili Zheng, Weiqun Zhang, Bei Wang, S. Ethier, Zhengji Zhao","doi":"10.1109/PAW.2016.8","DOIUrl":"https://doi.org/10.1109/PAW.2016.8","url":null,"abstract":"Nearest-neighbor communication is one of the most important communication patterns appearing in many scientific applications. In this paper, we discuss the results of applying UPC++, a library-based partitioned global address space (PGAS) programming extension to C++, to an adaptive mesh framework (BoxLib), and a full scientific application GTC-P, whose communications are dominated by the nearest-neighbor communication. The results on a Cray XC40 system show that compared with the highly-tuned MPI two-sided implementations, UPC++ improves the communication performance up to 60% and 90% for BoxLib and GTC-P, respectively. We also implement the nearest-neighbor communication using MPI one-sided messages. The performance comparison demonstrates that the MPI one-sided implementation can also improve the communication performance over the two-sided version but not so significantly as UPC++ does.","PeriodicalId":383847,"journal":{"name":"2016 PGAS Applications Workshop (PAW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133537935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}