Loop Parallelization using Dynamic Commutativity Analysis

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI:10.1109/CGO51591.2021.9370319

Christos Vasiladiotis, Roberto Castañeda Lozano, M. Cole, Björn Franke

{"title":"Loop Parallelization using Dynamic Commutativity Analysis","authors":"Christos Vasiladiotis, Roberto Castañeda Lozano, M. Cole, Björn Franke","doi":"10.1109/CGO51591.2021.9370319","DOIUrl":null,"url":null,"abstract":"Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6 × (and up to 55x) across the NAS benchmarks on a 72-core host, and up to 36.9x for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO51591.2021.9370319","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic parallelization has largely failed to keep its promise of extracting parallelism from sequential legacy code to maximize performance on multi-core systems outside the numerical domain. In this paper, we develop a novel dynamic commutativity analysis (DCA) for identifying parallelizable loops. Using commutativity instead of dependence tests, DCA avoids many of the overly strict data dependence constraints limiting existing parallelizing compilers. DCA extends the scope of automatic parallelization to uniformly include both regular array-based and irregular pointer-based codes. We have prototyped our novel parallelism detection analysis and evaluated it extensively against five state-of-the-art dependence-based techniques in two experimental settings. First, when applied to the NAS benchmarks which contain almost 1400 loops, DCA is able to identify as many parallel loops (over 1200) as the profile-guided dependence techniques and almost twice as many as all the static techniques combined. We then apply DCA to complex pointer-based loops, where it can successfully detect parallelism, while existing techniques fail to identify any. When combined with existing parallel code generation techniques, this results in an average speedup of 3.6 × (and up to 55x) across the NAS benchmarks on a 72-core host, and up to 36.9x for the pointer-based loops, demonstrating the effectiveness of DCA in identifying profitable parallelism across a wide range of loops.

查看原文本刊更多论文

使用动态交换性分析的循环并行化

自动并行化在很大程度上未能兑现其从顺序遗留代码中提取并行性以在数值领域以外的多核系统上最大化性能的承诺。在本文中，我们提出了一种新的动态交换性分析(DCA)来识别可并行循环。使用交换性而不是依赖性测试，DCA避免了许多限制现有并行编译器的过于严格的数据依赖性约束。DCA扩展了自动并行化的范围，以统一地包括规则的基于数组和不规则的基于指针的代码。我们制作了新型并行检测分析的原型，并在两个实验环境中对五种最先进的依赖技术进行了广泛的评估。首先，当应用于包含近1400个循环的NAS基准测试时，DCA能够识别与配置文件引导依赖性技术一样多的并行循环(超过1200个)，几乎是所有静态技术总和的两倍。然后，我们将DCA应用于复杂的基于指针的循环，它可以成功地检测并行性，而现有技术无法识别任何并行性。当与现有的并行代码生成技术结合使用时，在72核主机上的NAS基准测试中平均加速3.6倍(最高55倍)，在基于指针的循环中平均加速36.9倍，这证明了DCA在广泛循环中识别有益的并行性方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

自引率

0.00%

发文量