Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server

Proc. Meet. Acoust. Pub Date : 2018-10-22 DOI:10.1121/2.0000883

B. Treeby, Filip Vaverka, J. Jaros

{"title":"Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server","authors":"B. Treeby, Filip Vaverka, J. Jaros","doi":"10.1121/2.0000883","DOIUrl":null,"url":null,"abstract":"Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...","PeriodicalId":20469,"journal":{"name":"Proc. Meet. Acoust.","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. Meet. Acoust.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1121/2.0000883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...

查看原文本刊更多论文

基于局部域分解的非线性k波仿真性能与精度分析

使用开源k-Wave工具箱进行大规模非线性超声模拟，现在通常使用MPI版本的k-Wave在传统的基于cpu的集群上运行。然而，3D快速傅立叶变换(FFT)所需的全对全通信在扩展到大量计算核心时严重影响性能。这可以通过使用基于局部傅里叶基的域分解策略来克服。在这项工作中，我们分析了在包含八个NVIDIA P40图形处理单元(gpu)的单个服务器上使用局部域分解在肾脏中运行高强度聚焦超声(HIFU)模拟的性能和准确性。研究了不同的分解和重叠大小，并与在使用1280核的基于cpu的超级计算机上运行的全局MPI模拟进行了比较。对于960 × 960 × 1280网格点的网格大小和4个网格点的重叠大小，与全局模拟相比，使用局部域分解的模拟误差约为0.1美元，这对于大多数应用来说已经足够了。运行模拟的财务成本也降低了一个数量级以上。使用开源k-Wave工具箱进行大规模非线性超声模拟，现在通常使用MPI版本的k-Wave在传统的基于cpu的集群上运行。然而，3D快速傅立叶变换(FFT)所需的全对全通信在扩展到大量计算核心时严重影响性能。这可以通过使用基于局部傅里叶基的域分解策略来克服。在这项工作中，我们分析了在包含八个NVIDIA P40图形处理单元(gpu)的单个服务器上使用局部域分解在肾脏中运行高强度聚焦超声(HIFU)模拟的性能和准确性。研究了不同的分解和重叠大小，并与在使用1280核的基于cpu的超级计算机上运行的全局MPI模拟进行了比较。当网格大小为960 × 960 × 1280网格点，重叠大小为4个网格点时，局部域分解模拟误差约为0.1…

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. Meet. Acoust.

自引率

0.00%

发文量