Aroon Sharma, Darren Smith, Joshua Koehler, R. Barua, Michael P. Ferguson
{"title":"Affine Loop Optimization Based on Modulo Unrolling in Chapel","authors":"Aroon Sharma, Darren Smith, Joshua Koehler, R. Barua, Michael P. Ferguson","doi":"10.1145/2676870.2676877","DOIUrl":null,"url":null,"abstract":"This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-trivial run time overhead, a significant component of which is independent of the size of the message. Therefore, aggregating messages improves performance. Our optimization for message aggregation is based on a technique known as modulo unrolling, pioneered by Barua [3], whose purpose was to ensure a statically predictable single tile number for each memory reference for tiled architectures, such as the MIT Raw Machine [18]. Modulo unrolling WU applies to data that is distributed in a cyclic or block-cyclic manner. In this paper, we adapt the aforementioned modulo unrolling technique to the difficult problem of efficiently compiling PGAS languages to message passing architectures. When applied to loops and data distributed cyclically or block-cyclically, modulo unrolling WU can decide when to aggregate messages thereby reducing the overall message count and runtime for a particular loop. Compared to other methods, modulo unrolling WU greatly simplifies the complex problem of automatic code generation of message passing code. It also results in substantial performance improvement compared to the non-optimized Chapel compiler.\n To implement this optimization in Chapel, we modify the leader and follower iterators in the Cyclic and Block Cyclic data distribution modules. Results were collected that compare the performance of Chapel programs optimized with modulo unrolling WU and Chapel programs using the existing Chapel data distributions. Data collected on a ten-locale cluster show that on average, modulo unrolling WU used with Chapel's Cyclic distribution results in 64 percent fewer messages and a 36 percent decrease in runtime for our suite of benchmarks. Similarly, modulo unrolling WU used with Chapel's Block Cyclic distribution results in 72 percent fewer messages and a 53 percent decrease in runtime.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Partitioned Global Address Space Programming Models","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2676870.2676877","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
This paper presents modulo unrolling without unrolling (modulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine array accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-trivial run time overhead, a significant component of which is independent of the size of the message. Therefore, aggregating messages improves performance. Our optimization for message aggregation is based on a technique known as modulo unrolling, pioneered by Barua [3], whose purpose was to ensure a statically predictable single tile number for each memory reference for tiled architectures, such as the MIT Raw Machine [18]. Modulo unrolling WU applies to data that is distributed in a cyclic or block-cyclic manner. In this paper, we adapt the aforementioned modulo unrolling technique to the difficult problem of efficiently compiling PGAS languages to message passing architectures. When applied to loops and data distributed cyclically or block-cyclically, modulo unrolling WU can decide when to aggregate messages thereby reducing the overall message count and runtime for a particular loop. Compared to other methods, modulo unrolling WU greatly simplifies the complex problem of automatic code generation of message passing code. It also results in substantial performance improvement compared to the non-optimized Chapel compiler.
To implement this optimization in Chapel, we modify the leader and follower iterators in the Cyclic and Block Cyclic data distribution modules. Results were collected that compare the performance of Chapel programs optimized with modulo unrolling WU and Chapel programs using the existing Chapel data distributions. Data collected on a ten-locale cluster show that on average, modulo unrolling WU used with Chapel's Cyclic distribution results in 64 percent fewer messages and a 36 percent decrease in runtime for our suite of benchmarks. Similarly, modulo unrolling WU used with Chapel's Block Cyclic distribution results in 72 percent fewer messages and a 53 percent decrease in runtime.
本文在PGAS (Partitioned Global Address Space)并行编程语言Chapel中,提出了一种利用仿射数组访问的消息传递程序中并行循环的消息聚合方法——模不展开(modulo unrolling WU)。消息会产生不小的运行时开销,其中很大一部分开销与消息的大小无关。因此,聚合消息可以提高性能。我们对消息聚合的优化是基于一种称为模展开的技术,该技术由Barua[3]首创,其目的是确保平铺架构(如MIT Raw Machine[18])的每个内存引用具有静态可预测的单个平铺数。模展开WU适用于以循环或块循环方式分布的数据。在本文中,我们采用前面提到的模展开技术来解决高效编译PGAS语言到消息传递体系结构的难题。当应用于循环或块循环分布的循环和数据时,模展开WU可以决定何时聚合消息,从而减少特定循环的总体消息计数和运行时。与其他方法相比,模展开WU极大地简化了复杂的消息传递码自动生成问题。与未优化的Chapel编译器相比,它还带来了实质性的性能改进。为了在Chapel中实现这种优化,我们修改了Cyclic和Block Cyclic数据分发模块中的leader和follower迭代器。比较了采用模展开WU优化的Chapel程序和使用现有Chapel数据分布的Chapel程序的性能。在一个10个地点的集群上收集的数据显示,对于我们的基准测试套件,使用Chapel的循环分布的模展开WU平均可以减少64%的消息和36%的运行时间。类似地,与Chapel的块循环分布一起使用的模展开WU可以减少72%的消息和53%的运行时间。