ACM SIGPLAN Symposium on Scala最新文献_第2页

Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers 使用HPX和LibGeoDecomp在异构超级计算机上扩展HPC应用程序

ACM SIGPLAN Symposium on Scala Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530269

T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey

{"title":"Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers","authors":"T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey","doi":"10.1145/2530268.2530269","DOIUrl":"https://doi.org/10.1145/2530268.2530269","url":null,"abstract":"With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well.\u0000 In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage.\u0000 The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations 矩阵计算中蒙特卡罗稀疏近似逆的可扩展性

ACM SIGPLAN Symposium on Scala Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530274

J. Strassburg, V. Alexandrov

引用次数: 8

Robust distributed orthogonalization based on randomized aggregation 基于随机聚合的鲁棒分布式正交化

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133177

W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff

引用次数: 10

On non-blocking collectives in 3D FFTs 关于3D fft中的非阻塞集体

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133180

R. S. Saksena

引用次数: 4

The low-power architecture approach towards exascale computing 面向百亿亿次计算的低功耗架构方法

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133175

Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez

引用次数: 82

Soft error resilient QR factorization for hybrid system with GPGPU GPGPU混合系统的软误差弹性QR分解

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133179

Peng Du, P. Luszczek, S. Tomov, J. Dongarra

引用次数: 36

Performance analysis of a cardiac simulation code using IPM 使用IPM的心脏模拟代码的性能分析

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133186

P. Strazdins, M. Hegland

引用次数: 3

Fault tolerant matrix-matrix multiplication: correcting soft errors on-line 容错矩阵-矩阵乘法:在线修正软错误

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133185

Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen

引用次数: 35

Layout-aware scientific computing: a case study using MILC 感知布局的科学计算:使用MILC的案例研究

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133183

Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun

引用次数: 7

Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract 自顶向下的编程方法和工具与stars -支持可扩展的编程范例:扩展抽象

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133182

Rosa M. Badia

{"title":"Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract","authors":"Rosa M. Badia","doi":"10.1145/2133173.2133182","DOIUrl":"https://doi.org/10.1145/2133173.2133182","url":null,"abstract":"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\u0000 To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\u0000 This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\u0000 The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\u0000 Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\u0000 While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130002104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4