Communication Optimizations for Distributed-Memory X10 Programs

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.105

R. Barik, Jisheng Zhao, D. Grove, Igor Peshansky, Zoran Budimlic, Vivek Sarkar

{"title":"Communication Optimizations for Distributed-Memory X10 Programs","authors":"R. Barik, Jisheng Zhao, D. Grove, Igor Peshansky, Zoran Budimlic, Vivek Sarkar","doi":"10.1109/IPDPS.2011.105","DOIUrl":null,"url":null,"abstract":"X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node Blue Gene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the Blue Gene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node Blue Gene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the Blue Gene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.

查看原文本刊更多论文

分布式内存X10程序的通信优化

X10是一种新的面向对象PGAS(分区全局地址空间)编程语言，它支持分布式异步动态并行性，超越了过去的SPMD消息传递模型(如MPI)和SPMD PGAS模型(如UPC和Co-Array Fortran)。X10中的并发构造使得表达复杂的计算和通信结构成为可能，并且比其他分布式内存编程模型具有更高的生产率。然而，当完全使用该语言时，这种生产力往往是以高性能开销为代价的。本文介绍了高级编译器优化和转换，以减少X10程序的分布式内存实现中的通信和同步开销。具体来说，我们关注局部性优化，如标量替换和任务本地化，并结合支持转换，如循环分布、标量扩展、循环平铺和循环分割。我们已经完成了这些高级优化的原型实现，并执行了性能评估，显示了在性能、可伸缩性、通信量和任务数量方面的显着改进。我们在三个平台上评估了通信优化:一个128节点的Blue Gene/P集群、一个32节点的Nehalem集群和一个16节点的Power7集群。在Blue Gene/P集群上，我们观察到相对于未优化的情况(针对MolDyn基准测试)，最大性能提高了31.46倍。在Nehalem集群上，我们观察到最大性能提高了3.01倍(对于NQueens基准测试)，在Power7集群上，我们观察到最大性能提高了2.73倍(对于MolDyn基准测试)。此外，不存在优化后的代码比未优化的代码慢的情况。我们还认为，本文中提出的优化对于任何基于现代面向对象原则的高生产率PGAS语言都是必要的，这些语言是为在未来的极端规模系统上执行而设计的，这些系统高度重视性能和能源效率的局部改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量