Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI:10.1145/3149457.3154482

Keisuke Tsugane, Jinpil Lee, H. Murai, M. Sato

{"title":"Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters","authors":"Keisuke Tsugane, Jinpil Lee, H. Murai, M. Sato","doi":"10.1145/3149457.3154482","DOIUrl":null,"url":null,"abstract":"Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using user-specified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distributed-memory programming model. In this paper, we propose a multi-tasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters, we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3149457.3154482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using user-specified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distributed-memory programming model. In this paper, we propose a multi-tasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters, we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.

查看原文本刊更多论文

PGAS语言XcalableMP的多任务执行与多核集群的通信优化

最近已经部署了基于Intel Xeon Phi等多核处理器的大规模集群。OpenMP 4.0中使用任务依赖项的多任务执行很有希望促进这种多核处理器的并行化，因为这使用户能够通过使用用户指定的数据依赖项的细粒度任务到任务同步来避免全局同步。最近，分区全局地址空间(PGAS)模型作为一种可用的分布式内存编程模型出现了。本文提出了一种基于PGAS语言XcalableMP (XMP)的多核集群多任务执行模型。该模型提供了一种基于全局地址空间上的点对点通信来描述任务间交互的方法。通信在节点之间非集体执行。我们在XMP中实现了所提出的执行模型，并设计了一个简单的MPI和OpenMP之间的代码转换算法。我们使用我们的模型实现了两个基准进行初步评估，即阻塞Cholesky分解和拉普拉斯方程求解器。使用我们模型的大多数实现都优于传统的基于屏障的数据并行模型。为了提高多核集群的性能，我们提出了一种通信优化方法，通过将单个线程用于通信，以避免与当前多线程MPI执行相关的性能问题。结果表明，使用该通信优化的阻塞Cholesky分解和拉普拉斯方程求解器的性能分别比Intel Xeon Phi KNL集群中基于屏障的实现提高了138%和119%。从生产效率的角度来看，我们的模型在XMP中实现的程序与基于OpenMP任务依赖子句的实现几乎相同，因为XMP支持串行源代码的并行化，并且具有附加的指令和小的更改以及OpenMP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量