{"title":"高度可移植的基于c++的模拟器,具有双并行性和空间分解的模拟域,使用浮点运算和每瓦特更多的浮点数,以便更好地解决粒子模拟的问题","authors":"Nisha Agrawal, A. Das, R. Pathak, M. Modani","doi":"10.1109/icccs55155.2022.9846280","DOIUrl":null,"url":null,"abstract":"LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Highly Portable C++ Based Simulator with Dual Parallelism and Spatial Decomposition of Simulation Domain using Floating Point Operations and More Flops Per Watt for Better Time-To-Solution on Particle Simulation\",\"authors\":\"Nisha Agrawal, A. Das, R. Pathak, M. Modani\",\"doi\":\"10.1109/icccs55155.2022.9846280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.\",\"PeriodicalId\":121713,\"journal\":{\"name\":\"2022 7th International Conference on Computer and Communication Systems (ICCCS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Computer and Communication Systems (ICCCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icccs55155.2022.9846280\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Highly Portable C++ Based Simulator with Dual Parallelism and Spatial Decomposition of Simulation Domain using Floating Point Operations and More Flops Per Watt for Better Time-To-Solution on Particle Simulation
LAMMPS is a classical molecular dynamics (MD) code that models ensembles of particles in a solid, liquid, or gaseous state. LAMMPS performance has been optimized over the years. This paper focuses on the study of MD simulations using LAMMPS on the PARAM Siddhi-AI system. LAMMPS’s performance is analyzed with the number of the latest NVIDIA A100 co-processor (GPU) based on Ampere architecture. The faster and larger L1 cache and shared memory in Ampere architecture (192 KB per Streaming Multiprocessor (SM)) delivers additional speedups for High-Performance Computing (HPC) workloads. In this work, single-node multi-GPUs as well as multi-node multi-GPUs (up to 5 nodes) LAMMPS performance for two input datasets LJ 2.5 (intermolecular pair potential) and EAM (interatomic potential), on PARAM Siddhi-AI system, is discussed. Performance improvement of GPU-enabled LAMMPS run over CPU-only performance is demonstrated for both the inputs data sets. LAMMPS performance is analyzed for initialization, atoms communication, forces, thermodynamic state (Pair (non-bonded force computations), Neigh (neighbor list construction), Comm (inter-processor communication of atoms and their properties), Output (output of thermodynamic info and dump files), Modify (fixes and computes invoked by fixes) and others (all the remaining time forces and functions)). GPU utilization (in terms of computing and memory) is also discussed for both the input datasets. GPU enabled LAMMPS single-node performance shows 31x and 125x speed-up in comparison to single-node CPU-only performance for LJ2.5 and EAM input datasets respectively. LAMMPS scales well across multi-node and shows almost linear scalability. In comparison to single node GPU enabled LAMMPS run, the observed speedup is 4.1x and 3.8x on 5 nodes, for LJ2.5 and EAM input datasets respectively. LAMMPS performance comparison across GPU generation shows 1.5x to 1.9x speedup on A100 GPUs over V100 GPUs.