适用于分布式多 GPU 架构的高效 RI-MP2 算法

IF 5.7 1区化学 Q2 CHEMISTRY, PHYSICAL

Journal of Chemical Theory and Computation Pub Date : 2024-10-18 DOI:10.1021/acs.jctc.4c00814

Calum Snowdon, Giuseppe M. J. Barca

{"title":"适用于分布式多 GPU 架构的高效 RI-MP2 算法","authors":"Calum Snowdon, Giuseppe M. J. Barca","doi":"10.1021/acs.jctc.4c00814","DOIUrl":null,"url":null,"abstract":"Second-order Møller–Plesset perturbation theory (MP2) using the Resolution of the Identity approximation (RI-MP2) is a widely used method for computing molecular energies beyond the Hartree–Fock mean-field approximation. However, its high computational cost and lack of efficient algorithms for modern supercomputing architectures limit its applicability to large molecules. In this paper, we present the first distributed-memory many-GPU RI-MP2 algorithm explicitly designed to utilize hundreds of GPU accelerators for every step of the computation. Our novel algorithm achieves near-peak performance on GPU-based supercomputers through the development of a distributed memory algorithm for forming RI-MP2 intermediate tensors with zero internode communication, except for a single <mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\"false\">)</mo></mrow></math>' role=\"presentation\" style=\"position: relative;\" tabindex=\"0\">𝒪(𝑁2)<math display=\"inline\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\"false\">)</mo></mrow></math><script type=\"math/mml\"><math display=\"inline\"><mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\"false\">)</mo></mrow></math></script> asynchronous broadcast, and a distributed memory algorithm for the <mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\"false\">)</mo></mrow></math>' role=\"presentation\" style=\"position: relative;\" tabindex=\"0\">𝒪(𝑁5)<math display=\"inline\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\"false\">)</mo></mrow></math><script type=\"math/mml\"><math display=\"inline\"><mi mathvariant=\"script\">O</mi><mrow><mo stretchy=\"false\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\"false\">)</mo></mrow></math></script> energy reduction step, capable of sustaining near-peak performance on clusters with several hundred GPUs. Comparative analysis shows our implementation outperforms state-of-the-art quantum chemistry software by over 3.5 times in speed while achieving an 8-fold reduction in computational power consumption. Benchmarking on the Perlmutter supercomputer, our algorithm achieves 11.8 PFLOP/s (83% of peak performance) performing and the RI-MP2 energy calculation on a 314-water cluster with 7850 primary and 30,144 auxiliary basis functions in 4 min on 180 nodes and 720 A100 GPUs. This performance represents a substantial improvement over traditional CPU-based methods, demonstrating significant time-to-solution and power consumption benefits of leveraging modern GPU-accelerated computing environments for quantum chemistry calculations.","PeriodicalId":45,"journal":{"name":"Journal of Chemical Theory and Computation","volume":null,"pages":null},"PeriodicalIF":5.7000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient RI-MP2 Algorithm for Distributed Many-GPU Architectures\",\"authors\":\"Calum Snowdon, Giuseppe M. J. Barca\",\"doi\":\"10.1021/acs.jctc.4c00814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Second-order Møller–Plesset perturbation theory (MP2) using the Resolution of the Identity approximation (RI-MP2) is a widely used method for computing molecular energies beyond the Hartree–Fock mean-field approximation. However, its high computational cost and lack of efficient algorithms for modern supercomputing architectures limit its applicability to large molecules. In this paper, we present the first distributed-memory many-GPU RI-MP2 algorithm explicitly designed to utilize hundreds of GPU accelerators for every step of the computation. Our novel algorithm achieves near-peak performance on GPU-based supercomputers through the development of a distributed memory algorithm for forming RI-MP2 intermediate tensors with zero internode communication, except for a single <mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math>' role=\\\"presentation\\\" style=\\\"position: relative;\\\" tabindex=\\\"0\\\">𝒪(𝑁2)<math display=\\\"inline\\\" xmlns=\\\"http://www.w3.org/1998/Math/MathML\\\"><mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math><script type=\\\"math/mml\\\"><math display=\\\"inline\\\"><mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>2</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math></script> asynchronous broadcast, and a distributed memory algorithm for the <mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math>' role=\\\"presentation\\\" style=\\\"position: relative;\\\" tabindex=\\\"0\\\">𝒪(𝑁5)<math display=\\\"inline\\\" xmlns=\\\"http://www.w3.org/1998/Math/MathML\\\"><mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math><script type=\\\"math/mml\\\"><math display=\\\"inline\\\"><mi mathvariant=\\\"script\\\">O</mi><mrow><mo stretchy=\\\"false\\\">(</mo><msup><mi>N</mi><mn>5</mn></msup><mo stretchy=\\\"false\\\">)</mo></mrow></math></script> energy reduction step, capable of sustaining near-peak performance on clusters with several hundred GPUs. Comparative analysis shows our implementation outperforms state-of-the-art quantum chemistry software by over 3.5 times in speed while achieving an 8-fold reduction in computational power consumption. Benchmarking on the Perlmutter supercomputer, our algorithm achieves 11.8 PFLOP/s (83% of peak performance) performing and the RI-MP2 energy calculation on a 314-water cluster with 7850 primary and 30,144 auxiliary basis functions in 4 min on 180 nodes and 720 A100 GPUs. This performance represents a substantial improvement over traditional CPU-based methods, demonstrating significant time-to-solution and power consumption benefits of leveraging modern GPU-accelerated computing environments for quantum chemistry calculations.\",\"PeriodicalId\":45,\"journal\":{\"name\":\"Journal of Chemical Theory and Computation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2024-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Theory and Computation\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jctc.4c00814\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Theory and Computation","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jctc.4c00814","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

使用同位解析近似（RI-MP2）的二阶默勒-普莱塞特扰动理论（MP2）是一种广泛应用于计算哈特里-福克均场近似之外的分子能量的方法。然而，由于其计算成本高昂，且缺乏适用于现代超级计算架构的高效算法，限制了其对大分子的适用性。在本文中，我们提出了首个分布式内存多 GPU RI-MP2 算法，该算法明确设计为在计算的每个步骤中利用数百个 GPU 加速器。我们的新算法通过开发一种分布式内存算法来形成 RI-MP2 中间张量，其节点间通信为零，从而在基于 GPU 的超级计算机上实现了接近峰值的性能、除了单一的𝒪(𝑁2)O(N2)O(N2)异步广播外，我们还开发了一种用于𝒪(𝑁5)O(N5)O(N5)能量削减步骤的分布式内存算法，能够在拥有数百个 GPU 的集群上维持接近峰值的性能。对比分析表明，我们的实现速度是最先进量子化学软件的 3.5 倍以上，同时计算功耗降低了 8 倍。在 Perlmutter 超级计算机上进行基准测试时，我们的算法达到了 11.8 PFLOP/s（峰值性能的 83%），并在 180 个节点和 720 个 A100 GPU 的 4 分钟内，在一个拥有 7850 个主基函数和 30144 个辅助基函数的 314 水集群上进行了 RI-MP2 能量计算。与传统的基于 CPU 的方法相比，这一性能有了大幅提升，证明了利用现代 GPU 加速计算环境进行量子化学计算在时间到分辨率和功耗方面的显著优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

An Efficient RI-MP2 Algorithm for Distributed Many-GPU Architectures

查看原文本刊更多论文

An Efficient RI-MP2 Algorithm for Distributed Many-GPU Architectures

Second-order Møller–Plesset perturbation theory (MP2) using the Resolution of the Identity approximation (RI-MP2) is a widely used method for computing molecular energies beyond the Hartree–Fock mean-field approximation. However, its high computational cost and lack of efficient algorithms for modern supercomputing architectures limit its applicability to large molecules. In this paper, we present the first distributed-memory many-GPU RI-MP2 algorithm explicitly designed to utilize hundreds of GPU accelerators for every step of the computation. Our novel algorithm achieves near-peak performance on GPU-based supercomputers through the development of a distributed memory algorithm for forming RI-MP2 intermediate tensors with zero internode communication, except for a single

O (N^{2})

$O (N^{2})$ asynchronous broadcast, and a distributed memory algorithm for the

O (N^{5})

$O (N^{5})$ energy reduction step, capable of sustaining near-peak performance on clusters with several hundred GPUs. Comparative analysis shows our implementation outperforms state-of-the-art quantum chemistry software by over 3.5 times in speed while achieving an 8-fold reduction in computational power consumption. Benchmarking on the Perlmutter supercomputer, our algorithm achieves 11.8 PFLOP/s (83% of peak performance) performing and the RI-MP2 energy calculation on a 314-water cluster with 7850 primary and 30,144 auxiliary basis functions in 4 min on 180 nodes and 720 A100 GPUs. This performance represents a substantial improvement over traditional CPU-based methods, demonstrating significant time-to-solution and power consumption benefits of leveraging modern GPU-accelerated computing environments for quantum chemistry calculations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Theory and Computation 化学-物理：原子、分子和化学物理

CiteScore

9.90

自引率

16.40%

发文量

568

审稿时长

1 months

期刊介绍： The Journal of Chemical Theory and Computation invites new and original contributions with the understanding that, if accepted, they will not be published elsewhere. Papers reporting new theories, methodology, and/or important applications in quantum electronic structure, molecular dynamics, and statistical mechanics are appropriate for submission to this Journal. Specific topics include advances in or applications of ab initio quantum mechanics, density functional theory, design and properties of new materials, surface science, Monte Carlo simulations, solvation models, QM/MM calculations, biomolecular structure prediction, and molecular dynamics in the broadest sense including gas-phase dynamics, ab initio dynamics, biomolecular dynamics, and protein folding. The Journal does not consider papers that are straightforward applications of known methods including DFT and molecular dynamics. The Journal favors submissions that include advances in theory or methodology with applications to compelling problems.