Scalable fast multipole methods on distributed heterogeneous architectures

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2011-11-12 DOI:10.1145/2063384.2063432

Qi Hu, N. Gumerov, R. Duraiswami

{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":null,"url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2063384.2063432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.

查看原文本刊更多论文

分布式异构体系结构的可扩展快速多极方法

我们从根本上重新考虑快速多极方法(FMM)在计算节点上的实现，该计算节点具有异构CPU-GPU架构，具有多核CPU(s)和一个或多个GPU加速器，以及在此类节点的互连集群上。FMM是一种分治算法，它使用空间分解执行快速的n体求和，通常用于时间步进或迭代循环。观察到FMM的局部求和和基于分析的平移部分是独立的，我们将它们分别映射到gpu和cpu上。对FMM进行了仔细的分析，以便在多核cpu和GPU加速器之间最佳地分配工作。我们首先开发了一个单节点版本，其中CPU部分使用OpenMP并行化，GPU版本通过CUDA并行化。提出了用于创建FMM数据结构的新的并行算法以及单节点和分布式多节点版本的负载均衡策略。我们的实现可以在4.23秒内对16个节点上的128M个粒子进行n体求和，这是其他文献中没有达到的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

自引率

0.00%

发文量