{"title":"基于局部域分解的非线性k波仿真性能与精度分析","authors":"B. Treeby, Filip Vaverka, J. Jaros","doi":"10.1121/2.0000883","DOIUrl":null,"url":null,"abstract":"Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...","PeriodicalId":20469,"journal":{"name":"Proc. Meet. Acoust.","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server\",\"authors\":\"B. Treeby, Filip Vaverka, J. Jaros\",\"doi\":\"10.1121/2.0000883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...\",\"PeriodicalId\":20469,\"journal\":{\"name\":\"Proc. Meet. Acoust.\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. Meet. Acoust.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1121/2.0000883\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. Meet. Acoust.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1121/2.0000883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance and accuracy analysis of nonlinear k-Wave simulations using local domain decomposition with an 8-GPU server
Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1$ compared to the global simulation, which is sufficient for most applications. The financial cost for running the simulation is also reduced by more than an order of magnitude.Large-scale nonlinear ultrasound simulations using the open-source k-Wave toolbox are now routinely performed using the MPI version of k-Wave running on traditional CPU-based clusters. However, the all-to-all communications required by the 3D fast Fourier transform (FFT) severely impact performance when scaling to large numbers of compute cores. This can be overcome by using a domain decomposition strategy based on a local Fourier basis. In this work, we analyze the performance and accuracy of using local domain decomposition for running a high-intensity focused ultrasound (HIFU) simulation in the kidney on a single server containing eight NVIDIA P40 graphical processing units (GPUs). Different decompositions and overlap sizes are investigated and compared to a global MPI simulation running on a CPU-based supercomputer using 1280 cores. For a grid size of 960 by 960 by 1280 grid points and an overlap size of 4 grid points, the error in the simulation using local domain decomposition is on the order of 0.1...