Warp-aware trace scheduling for GPUs

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2628101

James A. Jablin, T. Jablin, O. Mutlu, M. Herlihy

{"title":"Warp-aware trace scheduling for GPUs","authors":"James A. Jablin, T. Jablin, O. Mutlu, M. Herlihy","doi":"10.1145/2628071.2628101","DOIUrl":null,"url":null,"abstract":"GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time. Here, we propose \"Warp-Aware Trace Scheduling\" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12× and reducing instruction serialization and total instructions executed.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time. Here, we propose "Warp-Aware Trace Scheduling" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12× and reducing instruction serialization and total instructions executed.

查看原文本刊更多论文

gpu的扭曲感知跟踪调度

GPU性能不仅取决于线程/warp级并行性(TLP)，还取决于指令级并行性(ILP)。在基本块内调度指令是不够的，还需要利用超越分支边界的ILP优化机会。不幸的是，现代gpu不能动态地执行这种优化，因为它们缺乏硬件分支预测，不能推测地执行超出分支的指令。我们建议通过采用跟踪调度来规避这些限制，跟踪调度是一种最初为微码优化而开发的技术。跟踪调度将代码划分为跟踪(或路径)，并以与上下文无关的方式优化每个跟踪。适应GPU代码的跟踪调度需要重新访问和修改微码跟踪调度的每个步骤，以关注分支和扭曲行为，识别关键路径上的指令，避免扭曲发散，减少发散时间。在这里，我们提出了gpu的“扭曲感知跟踪调度”。在Rodinia Benchmark Suite上使用动态分析进行评估时，我们的全自动优化通过将每周期执行的指令(IPC)增加1.12倍的谐波平均值，并减少指令序列化和执行的总指令，在实际系统上实现了1.10倍的几何平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量