Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server

B. Treeby, Filip Vaverka, J. Jaros
{"title":"Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server","authors":"B. Treeby, Filip Vaverka, J. Jaros","doi":"10.1121/2.0000883","DOIUrl":null,"url":null,"abstract":"Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...","PeriodicalId":20469,"journal":{"name":"Proc. Meet. Acoust.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. Meet. Acoust.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1121/2.0000883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...
基于局部域分解的非线性k波仿真性能与精度分析
使用开源k-Wave工具箱进行大规模非线性超声模拟,现在通常使用MPI版本的k-Wave在传统的基于cpu的集群上运行。然而,3D快速傅立叶变换(FFT)所需的全对全通信在扩展到大量计算核心时严重影响性能。这可以通过使用基于局部傅里叶基的域分解策略来克服。在这项工作中,我们分析了在包含八个NVIDIA P40图形处理单元(gpu)的单个服务器上使用局部域分解在肾脏中运行高强度聚焦超声(HIFU)模拟的性能和准确性。研究了不同的分解和重叠大小,并与在使用1280核的基于cpu的超级计算机上运行的全局MPI模拟进行了比较。对于960 × 960 × 1280网格点的网格大小和4个网格点的重叠大小,与全局模拟相比,使用局部域分解的模拟误差约为0.1美元,这对于大多数应用来说已经足够了。运行模拟的财务成本也降低了一个数量级以上。使用开源k-Wave工具箱进行大规模非线性超声模拟,现在通常使用MPI版本的k-Wave在传统的基于cpu的集群上运行。然而,3D快速傅立叶变换(FFT)所需的全对全通信在扩展到大量计算核心时严重影响性能。这可以通过使用基于局部傅里叶基的域分解策略来克服。在这项工作中,我们分析了在包含八个NVIDIA P40图形处理单元(gpu)的单个服务器上使用局部域分解在肾脏中运行高强度聚焦超声(HIFU)模拟的性能和准确性。研究了不同的分解和重叠大小,并与在使用1280核的基于cpu的超级计算机上运行的全局MPI模拟进行了比较。当网格大小为960 × 960 × 1280网格点,重叠大小为4个网格点时,局部域分解模拟误差约为0.1…
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信