Simultaneous branch and warp interweaving for sustained GPU performance

2012 39th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2012-06-09 DOI:10.1145/2366231.2337166

Nicolas Brunie, Caroline Collange, G. Diamos

{"title":"Simultaneous branch and warp interweaving for sustained GPU performance","authors":"Nicolas Brunie, Caroline Collange, G. Diamos","doi":"10.1145/2366231.2337166","DOIUrl":null,"url":null,"abstract":"Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"92","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 92

Abstract

Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.

查看原文本刊更多论文

同时分支和经纱交织为持续的GPU性能

在图形处理单元(gpu)中实现的指令多线程(SIMT)微架构通过将细粒度线程分组成单元(称为warp)来同步运行细粒度线程，以便在多个执行单元上分摊指令获取、解码和控制逻辑的成本。由于各个线程采用不同的执行路径，因此它们的处理是顺序进行的，这部分削弱了SIMD执行的效率优势。我们提出了两种互补的技术来减轻线程分歧对SIMT微架构的影响。这两种技术都允许将两个不同的指令调度到同一行执行单元的不相交的子集，而不是单个指令，从而放宽了SIMD执行模型。它们通过提供比SIMD更多的线程分组机会来提高灵活性，同时保留线程之间的亲缘性，以避免引入额外的内存分歧。我们考虑(1)从同一经线的不同发散路径共同发出指令;(2)从不同经线共同发出指令。为了支持(1)，我们引入了一种新的线程再收敛技术，该技术确保线程在控制流再收敛点同步运行，而不会妨碍它们并行运行分支的能力。我们提出了一种通道洗牌技术，以使解决方案(2)受益于发散模式中的经线间相关性。所有这些技术的组合在一组常规GPGPU应用程序上提高了23%的性能，在一组不规则应用程序上提高了40%，同时保持了与当代费米GPU架构相同的指令获取和处理单元资源需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 39th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量