Kanta Suzuki, Yasuaki Ito, Haruto Fujii, Nobuya Yokogawa, Satoki Tsuji, Koji Nakano, Victor Parque, Akihiko Kasagi
{"title":"三中心双电子斥力积分的高效GPU实现","authors":"Kanta Suzuki, Yasuaki Ito, Haruto Fujii, Nobuya Yokogawa, Satoki Tsuji, Koji Nakano, Victor Parque, Akihiko Kasagi","doi":"10.1002/cpe.70328","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In computational quantum chemistry, the computation of three-center two-electron repulsion integrals (also termed three-center ERIs) is essential for density fitting. Due to the large number of integral elements and the induced combinatorial computational complexity, the community has actively pursued the acceleration/speedup of ERI calculations to achieve pragmatic levels of efficiency. From the perspective of GPU acceleration, <span>atomicAdd</span> is known to incur significant memory overhead: The frequent collisions and retrials of value aggregation in global GPU memory lead to substantial performance degradation. To tackle this issue, we propose new thread mapping strategies for three-center two-electron integrals on GPUs, aiming at reducing the computational cost associated with value aggregation. Our methods are based on the idea of suitable substitutions of device-level reduction (<span>atomicAdd</span>) with efficient warp- and thread-level reduction, such as warp-shuffle and register accumulation. As a result, our computational experiments using an Intel Xeon Gold 6338 CPU, an NVIDIA A100 GPU, and relevant molecules of interest show the superiority against the conventional thread mapping scheme, achieving up to 2.76 speedups to compute three-center ERIs more efficiently. Moreover, compared to well-known quantum chemistry software such as PySCF and GPU4PySCF, our method achieved up to <span></span><math>\n <semantics>\n <mrow>\n <mn>11</mn>\n <mo>.</mo>\n <mn>90</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 11.90\\times $$</annotation>\n </semantics></math> speedups over PySCF and up to <span></span><math>\n <semantics>\n <mrow>\n <mn>4</mn>\n <mo>.</mo>\n <mn>99</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 4.99\\times $$</annotation>\n </semantics></math> speedups over GPU4PySCF. Our method has the potential to further enhance the performance, extensibility, and versatility of GPU-accelerated quantum chemical computations.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 25-26","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient GPU Implementations of Three-Center Two-Electron Repulsion Integrals\",\"authors\":\"Kanta Suzuki, Yasuaki Ito, Haruto Fujii, Nobuya Yokogawa, Satoki Tsuji, Koji Nakano, Victor Parque, Akihiko Kasagi\",\"doi\":\"10.1002/cpe.70328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In computational quantum chemistry, the computation of three-center two-electron repulsion integrals (also termed three-center ERIs) is essential for density fitting. Due to the large number of integral elements and the induced combinatorial computational complexity, the community has actively pursued the acceleration/speedup of ERI calculations to achieve pragmatic levels of efficiency. From the perspective of GPU acceleration, <span>atomicAdd</span> is known to incur significant memory overhead: The frequent collisions and retrials of value aggregation in global GPU memory lead to substantial performance degradation. To tackle this issue, we propose new thread mapping strategies for three-center two-electron integrals on GPUs, aiming at reducing the computational cost associated with value aggregation. Our methods are based on the idea of suitable substitutions of device-level reduction (<span>atomicAdd</span>) with efficient warp- and thread-level reduction, such as warp-shuffle and register accumulation. As a result, our computational experiments using an Intel Xeon Gold 6338 CPU, an NVIDIA A100 GPU, and relevant molecules of interest show the superiority against the conventional thread mapping scheme, achieving up to 2.76 speedups to compute three-center ERIs more efficiently. Moreover, compared to well-known quantum chemistry software such as PySCF and GPU4PySCF, our method achieved up to <span></span><math>\\n <semantics>\\n <mrow>\\n <mn>11</mn>\\n <mo>.</mo>\\n <mn>90</mn>\\n <mo>×</mo>\\n </mrow>\\n <annotation>$$ 11.90\\\\times $$</annotation>\\n </semantics></math> speedups over PySCF and up to <span></span><math>\\n <semantics>\\n <mrow>\\n <mn>4</mn>\\n <mo>.</mo>\\n <mn>99</mn>\\n <mo>×</mo>\\n </mrow>\\n <annotation>$$ 4.99\\\\times $$</annotation>\\n </semantics></math> speedups over GPU4PySCF. Our method has the potential to further enhance the performance, extensibility, and versatility of GPU-accelerated quantum chemical computations.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 25-26\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70328\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70328","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Efficient GPU Implementations of Three-Center Two-Electron Repulsion Integrals
In computational quantum chemistry, the computation of three-center two-electron repulsion integrals (also termed three-center ERIs) is essential for density fitting. Due to the large number of integral elements and the induced combinatorial computational complexity, the community has actively pursued the acceleration/speedup of ERI calculations to achieve pragmatic levels of efficiency. From the perspective of GPU acceleration, atomicAdd is known to incur significant memory overhead: The frequent collisions and retrials of value aggregation in global GPU memory lead to substantial performance degradation. To tackle this issue, we propose new thread mapping strategies for three-center two-electron integrals on GPUs, aiming at reducing the computational cost associated with value aggregation. Our methods are based on the idea of suitable substitutions of device-level reduction (atomicAdd) with efficient warp- and thread-level reduction, such as warp-shuffle and register accumulation. As a result, our computational experiments using an Intel Xeon Gold 6338 CPU, an NVIDIA A100 GPU, and relevant molecules of interest show the superiority against the conventional thread mapping scheme, achieving up to 2.76 speedups to compute three-center ERIs more efficiently. Moreover, compared to well-known quantum chemistry software such as PySCF and GPU4PySCF, our method achieved up to speedups over PySCF and up to speedups over GPU4PySCF. Our method has the potential to further enhance the performance, extensibility, and versatility of GPU-accelerated quantum chemical computations.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.