A heterogeneous processing-in-memory approach to accelerate quantum chemistry simulation

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2023-07-01 DOI:10.1016/j.parco.2023.103017

Zeshi Liu , Zhen Xie , Wenqian Dong , Mengting Yuan , Haihang You , Dong Li

{"title":"A heterogeneous processing-in-memory approach to accelerate quantum chemistry simulation","authors":"Zeshi Liu , Zhen Xie , Wenqian Dong , Mengting Yuan , Haihang You , Dong Li","doi":"10.1016/j.parco.2023.103017","DOIUrl":null,"url":null,"abstract":"<div>The “memory wall” is an architectural property introducing high memory access latency that can manifest application performance, and this wall becomes even taller in the context of big data. Although the use of GPU-based systems could achieve high performance, it is difficult to improve the utilization of GPU systems due to the “memory wall”. The intensive data exchange and computation remains a challenge when confronting applications with a massive memory footprint. Quantum-mechanics-based ab initio calculations, which leverage high-performance computing to investigate multi-electron systems, have been widely used in computational chemistry. However, ab initio calculations are labor-intensive and can easily consume more than hundreds of gigabytes of memory. Previous efforts on heterogeneous accelerators via GPU and CPU suffer from high-latency off-device memory access. In this paper, we introduce heterogeneous processing-in-memory (PIM) to mitigate the overhead of data movement between CPUs and GPUs, and deeply analyze two of the most memory-intensive parts of the quantum chemistry, for example, the FFT and time-consuming loops. Specifically, we exploit runtime systems and programming models to improve hardware utilization and simplify programming efforts by moving computation close to the data and eliminating hardware idling. We take a widely used software, the QUANTUM ESPRESSO (opEn-Source Package for Research in Electronic Structure, Simulation, and Optimization), to perform our experiments, and our results show that our design provides up to <math><mrow><mn>4</mn><mo>.</mo><mn>09</mn><mo>×</mo></mrow></math> and <math><mrow><mn>2</mn><mo>.</mo><mn>60</mn><mo>×</mo></mrow></math> of performance improvement and 71% and 88% energy reduction over CPU and GPU (NVIDIA P100), respectively.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103017"},"PeriodicalIF":2.1000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819123000236","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 1

Abstract

The “memory wall” is an architectural property introducing high memory access latency that can manifest application performance, and this wall becomes even taller in the context of big data. Although the use of GPU-based systems could achieve high performance, it is difficult to improve the utilization of GPU systems due to the “memory wall”. The intensive data exchange and computation remains a challenge when confronting applications with a massive memory footprint. Quantum-mechanics-based ab initio calculations, which leverage high-performance computing to investigate multi-electron systems, have been widely used in computational chemistry. However, ab initio calculations are labor-intensive and can easily consume more than hundreds of gigabytes of memory. Previous efforts on heterogeneous accelerators via GPU and CPU suffer from high-latency off-device memory access. In this paper, we introduce heterogeneous processing-in-memory (PIM) to mitigate the overhead of data movement between CPUs and GPUs, and deeply analyze two of the most memory-intensive parts of the quantum chemistry, for example, the FFT and time-consuming loops. Specifically, we exploit runtime systems and programming models to improve hardware utilization and simplify programming efforts by moving computation close to the data and eliminating hardware idling. We take a widely used software, the QUANTUM ESPRESSO (opEn-Source Package for Research in Electronic Structure, Simulation, and Optimization), to perform our experiments, and our results show that our design provides up to $4.09 \times$ and $2.60 \times$ of performance improvement and 71% and 88% energy reduction over CPU and GPU (NVIDIA P100), respectively.

查看原文本刊更多论文

一种加速量子化学模拟的内存异构处理方法

“内存墙”是一种引入高内存访问延迟的体系结构特性，可以体现应用程序性能，在大数据的背景下，这堵墙会变得更高。尽管使用基于GPU的系统可以实现高性能，但由于“内存墙”的存在，很难提高GPU系统的利用率。当应用程序占用大量内存时，密集的数据交换和计算仍然是一个挑战。基于量子力学的从头计算利用高性能计算来研究多电子系统，已在计算化学中得到广泛应用。然而，从头计算是劳动密集型的，很容易消耗超过数百GB的内存。先前通过GPU和CPU对异构加速器的研究遭遇了高延迟的设备外内存访问。在本文中，我们引入了内存中的异构处理（PIM），以减轻CPU和GPU之间的数据移动开销，并深入分析了量子化学中两个内存最密集的部分，例如FFT和耗时的循环。具体来说，我们利用运行时系统和编程模型来提高硬件利用率，并通过将计算移动到数据附近和消除硬件空闲来简化编程工作。我们采用了一个广泛使用的软件QUANTUM ESPRESSO（用于电子结构、模拟和优化研究的opEn Source Package）来进行实验，结果表明，我们的设计比CPU和GPU（NVIDIA P100）分别提供了高达4.09倍和2.60倍的性能改进和71%和88%的能耗降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications