Running a Single Instruction Execution Stream to a Massively Parallelized Computational Operations

2021 IEEE 2nd International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET) Pub Date : 2021-12-02 DOI:10.1109/temsmet53515.2021.9768703

Nisha Agrawal, Abhishek Das, R. Pathak, M. Modani

{"title":"Running a Single Instruction Execution Stream to a Massively Parallelized Computational Operations","authors":"Nisha Agrawal, Abhishek Das, R. Pathak, M. Modani","doi":"10.1109/temsmet53515.2021.9768703","DOIUrl":null,"url":null,"abstract":"GROMACS for biochemical molecules simulations are being used extensively. GROMACS's performance is optimized over the years on various homogeneous as well as heterogeneous computing architectures. This paper focuses on the study of the behavior of Molecular Dynamics (MD) simulations using GROMACS on the PARAM Siddhi-AI system. The application performance is analyzed on CPUs (AMD EPYC) and GPUs (NVIDIA A100). For CPU-only runs, it is observed that the single-node performance is slightly better with OpenMPI when compared to threaded MPI. The combination of 16 MPI ranks with 8 OpenMP threads shows better single-node performance. The performance of multi-node CPU-only GROMACS runs increases by the factor of 1.1x with the increase in the number of nodes. For single-node GROMACS-GPU runs, all the forces (bonded, non-bonded, and PME) are offloaded to GPUs. However, in the case of multi-node GROMACS GPU runs, only bonded and non-bonded forces are offloaded to GPUs. For single-node runs, GROMACS-GPU shows ~18x better performance when compared to single-node CPU-only runs. Also for single-node runs, GROMACS-GPU performance is approximately ~3x better than that observed from accelerated GROMACS execution on 5 nodes.","PeriodicalId":170546,"journal":{"name":"2021 IEEE 2nd International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 2nd International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/temsmet53515.2021.9768703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

GROMACS for biochemical molecules simulations are being used extensively. GROMACS's performance is optimized over the years on various homogeneous as well as heterogeneous computing architectures. This paper focuses on the study of the behavior of Molecular Dynamics (MD) simulations using GROMACS on the PARAM Siddhi-AI system. The application performance is analyzed on CPUs (AMD EPYC) and GPUs (NVIDIA A100). For CPU-only runs, it is observed that the single-node performance is slightly better with OpenMPI when compared to threaded MPI. The combination of 16 MPI ranks with 8 OpenMP threads shows better single-node performance. The performance of multi-node CPU-only GROMACS runs increases by the factor of 1.1x with the increase in the number of nodes. For single-node GROMACS-GPU runs, all the forces (bonded, non-bonded, and PME) are offloaded to GPUs. However, in the case of multi-node GROMACS GPU runs, only bonded and non-bonded forces are offloaded to GPUs. For single-node runs, GROMACS-GPU shows ~18x better performance when compared to single-node CPU-only runs. Also for single-node runs, GROMACS-GPU performance is approximately ~3x better than that observed from accelerated GROMACS execution on 5 nodes.

查看原文本刊更多论文

将单个指令执行流运行到大规模并行计算操作

GROMACS在生化分子模拟中的应用越来越广泛。GROMACS的性能多年来在各种同构和异构计算架构上得到了优化。本文主要研究了在PARAM Siddhi-AI系统上使用GROMACS进行分子动力学(MD)模拟的行为。在cpu (AMD EPYC)和gpu (NVIDIA A100)上分析了应用程序的性能。对于仅cpu运行，可以观察到，与线程MPI相比，OpenMPI的单节点性能略好一些。16个MPI排名和8个OpenMP线程的组合显示出更好的单节点性能。多节点纯cpu的GROMACS运行性能随着节点数量的增加而提高1.1倍。单机GROMACS-GPU运行时，所有的力(绑定、非绑定、PME)都被卸载到gpu上。但在运行多节点GROMACS GPU的情况下，仅将绑定力和非绑定力卸载到GPU上。对于单节点运行，GROMACS-GPU的性能比单节点cpu运行好18倍。同样，对于单节点运行，GROMACS- gpu性能比在5个节点上加速GROMACS执行时观察到的性能大约好3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 2nd International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET)

自引率

0.00%

发文量