Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI:10.1145/3236367.3236371

Mingzhe Li, Xiaoyi Lu, H. Subramoni, D. Panda

{"title":"Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures","authors":"Mingzhe Li, Xiaoyi Lu, H. Subramoni, D. Panda","doi":"10.1145/3236367.3236371","DOIUrl":null,"url":null,"abstract":"Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as graph processing with irregular communication characteristics. To take advantage of a large number of hardware threads offered by KNL and POWER, HPDA applications and MPI RMA runtime need to be re-designed to get optimal performance. In this paper, we propose multi-threading and lock-free designs in the MPI runtime as well as Graph500 application on KNL and POWER architectures. At the micro-bench level, our proposed runtime-level designs are able to reduce the latency of uni-directional MPI_Put and MPI_Get by up to 3X compared to IntelMPI and Spectrum MPI. At the application level, with 1,024 processes on 32 KNL nodes, our proposed design could outperform IntelMPI library by 32%. With 512 processes on eight POWER nodes, our proposed design could outperform Spectrum MPI library by 19%. To the best of our knowledge, this is the first paper to design and evaluate MPI RMA-based graph processing applications on KNL and POWER architectures.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as graph processing with irregular communication characteristics. To take advantage of a large number of hardware threads offered by KNL and POWER, HPDA applications and MPI RMA runtime need to be re-designed to get optimal performance. In this paper, we propose multi-threading and lock-free designs in the MPI runtime as well as Graph500 application on KNL and POWER architectures. At the micro-bench level, our proposed runtime-level designs are able to reduce the latency of uni-directional MPI_Put and MPI_Get by up to 3X compared to IntelMPI and Spectrum MPI. At the application level, with 1,024 processes on 32 KNL nodes, our proposed design could outperform IntelMPI library by 32%. With 512 processes on eight POWER nodes, our proposed design could outperform Spectrum MPI library by 19%. To the best of our knowledge, this is the first paper to design and evaluate MPI RMA-based graph processing applications on KNL and POWER architectures.

查看原文本刊更多论文

基于KNL和POWER架构的多线程和无锁MPI RMA图处理

由于其强大的组件，Intel Knights Landing (KNL)和IBM POWER架构正广泛部署在现代超级计算系统上。MPI远程内存访问(RMA)模型提供单侧通信语义，已被视为开发高性能数据分析(HPDA)应用程序(如具有不规则通信特征的图形处理)的一种有吸引力的方法。为了利用KNL和POWER提供的大量硬件线程，HPDA应用程序和MPI RMA运行时需要重新设计以获得最佳性能。在本文中，我们提出了MPI运行时中的多线程和无锁设计，以及KNL和POWER架构上的Graph500应用程序。在微实验台上，与IntelMPI和Spectrum MPI相比，我们提出的运行时级设计能够将单向MPI_Put和MPI_Get的延迟减少3倍。在应用程序级别，在32个KNL节点上有1,024个进程，我们提出的设计可以比IntelMPI库高出32%。在8个POWER节点上有512个进程，我们提出的设计可以比Spectrum MPI库高出19%。据我们所知，这是第一篇在KNL和POWER架构上设计和评估基于MPI rma的图形处理应用程序的论文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th European MPI Users' Group Meeting

自引率

0.00%

发文量