高度可移植的基于c++的模拟器，具有双并行性和空间分解的模拟域，使用浮点运算和每瓦特更多的浮点数，以便更好地解决粒子模拟的问题

2022 7th International Conference on Computer and Communication Systems (ICCCS) Pub Date : 2022-04-22 DOI:10.1109/icccs55155.2022.9846280

Nisha Agrawal, A. Das, R. Pathak, M. Modani

{"title":"高度可移植的基于c++的模拟器，具有双并行性和空间分解的模拟域，使用浮点运算和每瓦特更多的浮点数，以便更好地解决粒子模拟的问题","authors":"Nisha Agrawal, A. Das, R. Pathak, M. Modani","doi":"10.1109/icccs55155.2022.9846280","DOIUrl":null,"url":null,"abstract":"LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Highly Portable C++ Based Simulator with Dual Parallelism and Spatial Decomposition of Simulation Domain using Floating Point Operations and More Flops Per Watt for Better Time-To-Solution on Particle Simulation\",\"authors\":\"Nisha Agrawal, A. Das, R. Pathak, M. Modani\",\"doi\":\"10.1109/icccs55155.2022.9846280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.\",\"PeriodicalId\":121713,\"journal\":{\"name\":\"2022 7th International Conference on Computer and Communication Systems (ICCCS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Computer and Communication Systems (ICCCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icccs55155.2022.9846280\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

LAMMPS是一个经典的分子动力学(MD)代码，模型在固体，液体或气体状态的粒子集合。LAMMPS的性能经过了多年的优化。本文主要研究了LAMMPS在PARAM Siddhi-AI系统上的MD仿真。采用最新的基于安培架构的NVIDIA A100协处理器(GPU)的数量对LAMMPS的性能进行了分析。在Ampere架构中，更快更大的L1缓存和共享内存(每个流式多处理器(SM) 192 KB)为高性能计算(HPC)工作负载提供了额外的速度提升。在这项工作中，讨论了单节点多gpu以及多节点多gpu(最多5个节点)LAMMPS在PARAM Siddhi-AI系统上两个输入数据集lj2.5(分子间对电势)和EAM(原子间电势)的性能。在两个输入数据集上演示了支持gpu的LAMMPS运行在仅支持cpu的性能上的性能改进。LAMMPS的性能分析包括初始化、原子通信、力、热力学状态(Pair(非键合力计算)、Neigh(邻居列表构建)、Comm(原子及其属性的处理器间通信)、Output(热力学信息和转储文件的输出)、Modify(修复和修复调用的计算)和其他(所有剩余的时间力和函数)。还讨论了两个输入数据集的GPU利用率(在计算和内存方面)。与LJ2.5和EAM输入数据集相比，启用GPU的LAMMPS单节点性能分别提高了31倍和125倍。LAMMPS可以很好地跨多节点扩展，并显示出几乎线性的可扩展性。与支持单节点GPU的LAMMPS运行相比，对于LJ2.5和EAM输入数据集，在5个节点上观察到的加速分别为4.1倍和3.8倍。跨GPU代的LAMMPS性能比较显示，A100 GPU比V100 GPU加速1.5到1.9倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Highly Portable C++ Based Simulator with Dual Parallelism and Spatial Decomposition of Simulation Domain using Floating Point Operations and More Flops Per Watt for Better Time-To-Solution on Particle Simulation

LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 7th International Conference on Computer and Communication Systems (ICCCS)

自引率

0.00%

发文量