A Case for a Flexible Scalar Unit in SIMT Architecture

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.21

Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou

{"title":"A Case for a Flexible Scalar Unit in SIMT Architecture","authors":"Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou","doi":"10.1109/IPDPS.2014.21","DOIUrl":null,"url":null,"abstract":"The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

查看原文本刊更多论文

SIMT体系结构中柔性标量单元的一种情况

图形处理单元(gpu)的广泛可用性和单指令多线程(SIMT)风格的编程模型使其成为高性能计算的一个有前途的选择。但是，由于SIMT风格的处理，即使所有线程的操作数相同，也会在每个线程中执行一条指令。为了克服这种低效率，AMD最新的图形核心下一代(GCN)架构将标量单元集成到SIMT单元中。在GCN中，SIMT单元和标量单元共享一个SIMT风格的指令流。根据指令类型的不同，指令可以发出给标量或SIMT单元。在本文中，我们建议扩展标量单元，使其既可以与SIMT单元共享指令流，也可以执行单独的指令流。由标量单元执行的程序称为标量程序，其目的是协助simt单元执行。标量程序可以由编译器自动从SIMT程序生成，也可以由专业开发人员手动开发。我们通过三种协同执行范例为我们提出的灵活标量单元提供了一个案例:数据预取、控制分歧消除和标量工作负载提取。我们的实验结果表明，与最先进的SIMT风格处理相比，使用我们提出的方法可以获得显着的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量