一个可扩展的多路径微架构，用于高效的GPU控制流

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-06-19 DOI:10.1109/HPCA.2014.6835936

Ahmed Eltantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt

{"title":"一个可扩展的多路径微架构，用于高效的GPU控制流","authors":"Ahmed Eltantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt","doi":"10.1109/HPCA.2014.6835936","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a warp thus degrading performance. In this paper, we propose a novel branch divergence handling mechanism that enables interleaved execution of divergent paths within a warp while maintaining immediate postdominator reconvergence. This multi-path microarchitecture decouples divergence and reconvergence tracking by replacing the stack-based structure typically employed to support SIMT execution with two tables: a warp split table and a warp reconvergence table. It also enables reconvergence before the immediate postdominator which is important for efficient execution of unstructured control flow. Evaluated on a set of benchmarks with complex divergent control flow, our proposal achieves up to a 7× speedup with a harmonic mean of 32% over conventional single-path SIMT execution.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"244 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":"{\"title\":\"A scalable multi-path microarchitecture for efficient GPU control flow\",\"authors\":\"Ahmed Eltantawy, Jessica Wenjie Ma, Mike O'Connor, Tor M. Aamodt\",\"doi\":\"10.1109/HPCA.2014.6835936\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a warp thus degrading performance. In this paper, we propose a novel branch divergence handling mechanism that enables interleaved execution of divergent paths within a warp while maintaining immediate postdominator reconvergence. This multi-path microarchitecture decouples divergence and reconvergence tracking by replacing the stack-based structure typically employed to support SIMT execution with two tables: a warp split table and a warp reconvergence table. It also enables reconvergence before the immediate postdominator which is important for efficient execution of unstructured control flow. Evaluated on a set of benchmarks with complex divergent control flow, our proposal achieves up to a 7× speedup with a harmonic mean of 32% over conventional single-path SIMT execution.\",\"PeriodicalId\":164587,\"journal\":{\"name\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"244 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"33\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2014.6835936\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

摘要

图形处理单元(gpu)越来越多地用于非图形计算。然而，在当前的gpu上，控制流分散的应用程序会导致性能下降。这些gpu通过序列化warp遇到的不同控制流路径的执行来实现SIMT执行模型。这种序列化可以掩盖构成warp的标量线程之间的线程级并行性，从而降低性能。在本文中，我们提出了一种新的分支发散处理机制，该机制可以在保持即时后支配子再收敛的同时，在曲内交错执行发散路径。这种多路径微架构通过用两个表取代通常用于支持SIMT执行的基于堆栈的结构来解耦发散和再收敛跟踪:一个经度分割表和一个经度再收敛表。它还允许在直接后支配子之前重新收敛，这对于有效执行非结构化控制流非常重要。在一组具有复杂发散控制流的基准测试中进行评估，我们的建议比传统的单路径SIMT执行实现了高达7倍的加速，谐波平均值为32%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A scalable multi-path microarchitecture for efficient GPU control flow

Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a warp thus degrading performance. In this paper, we propose a novel branch divergence handling mechanism that enables interleaved execution of divergent paths within a warp while maintaining immediate postdominator reconvergence. This multi-path microarchitecture decouples divergence and reconvergence tracking by replacing the stack-based structure typically employed to support SIMT execution with two tables: a warp split table and a warp reconvergence table. It also enables reconvergence before the immediate postdominator which is important for efficient execution of unstructured control flow. Evaluated on a set of benchmarks with complex divergent control flow, our proposal achieves up to a 7× speedup with a harmonic mean of 32% over conventional single-path SIMT execution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量