Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers

ACM SIGPLAN Symposium on Scala Pub Date : 2013-11-17 DOI:10.1145/2530268.2530269

T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey

{"title":"Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers","authors":"T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey","doi":"10.1145/2530268.2530269","DOIUrl":null,"url":null,"abstract":"With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well.\n In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage.\n The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Scala","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2530268.2530269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well. In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage. The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.

查看原文本刊更多论文

使用HPX和LibGeoDecomp在异构超级计算机上扩展HPC应用程序

随着PetaFLOP集群的普及，以及配备特殊加速卡(如Xeon Phi b[2])的异构机器的出现，计算机科学家面临着提高应用程序可扩展性的艰巨任务，超出了当今传统技术和编程模型的可能。此外，对高度自适应的运行时算法和处理高度非同构数据的应用程序的需求进一步阻碍了我们高效编写性能和可扩展性良好的代码的能力。在本文中，我们展示了使用HPX[19,3,29]的优势，HPX是一个用于任何规模应用的通用并行运行时系统，作为LibGeoDecomp[25]的后端，用于实现具有本地交互的三维n体模拟。当使用LibGeoDecomp的HPX和MPI后端时，我们比较了这个应用程序的缩放和性能结果。LibGeoDecomp是一个用于几何分解代码的库，实现了用户提供的仿真模型的思想，其中库处理空间和时间循环以及数据存储。在TACC的Stampede超级计算机[1]上，使用多达1024个节点(16384个传统内核)和多达16个Xeon Phi加速器(3856个硬件线程)进行了各种同构和异构运行，获得了上述结果。在使用HPX后端的配置中，已经实现了超过0.35 PFLOPS，这相当于并行应用程序效率约为79%。我们的测量证明了使用HPX公开的内在异步和消息驱动编程模型的优势，它支持更好的延迟隐藏、精细到中等粒度的并行性和基于约束的同步。HPX的统一编程模型简化了为异构资源编写高度并行的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGPLAN Symposium on Scala

自引率

0.00%

发文量