Divergence Analysis and Optimizations

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.63

Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr

{"title":"Divergence Analysis and Optimizations","authors":"Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr","doi":"10.1109/PACT.2011.63","DOIUrl":null,"url":null,"abstract":"The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"104","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.63","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 104

Abstract

The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.

查看原文本刊更多论文

发散分析和优化

对GPU编程日益增长的兴趣使人们重新关注单指令多数据(SIMD)执行模型。SIMD机器为应用程序开发人员提供了巨大的计算能力，然而，该模型也带来了限制。特别是，处理元素(pe)以锁步执行，并且可能由于条件分支引起的分歧而失去性能。面对分歧，一些pe执行，而另一些pe等待，这种交替在它们到达同步点时结束。在本文中，我们介绍了散度分析，这是一种静态分析，用于确定哪些程序变量对每个PE具有相同的值。这种分析在三个不同的方面是有用的:它改进了SIMD代码到非SIMD cpu的转换，它帮助开发人员手动改进他们的SIMD应用程序，它还指导编译器优化SIMD程序。我们通过引入分支融合来证明最后一点，分支融合是一种新的编译器优化，它通过基因测序算法识别不同程序路径之间的相似性链，并尽可能地将这些路径编织在一起。我们的实现已经被Ocelot开源CUDA编译器所接受，并且是公开可用的。我们已经在许多工业级GPU基准测试中测试了它，包括Rodinia和Nvidia的SDK。与动态分析器的结果相比，我们的散度分析有34%的假阳性率。我们的自动优化为并行快速排序增加了3%的加速，这是一个经过大量优化的基准。我们的手动优化将这个数字扩展到10%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量