A characterization and analysis of PTX kernels

2009 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2009-10-04 DOI:10.1109/IISWC.2009.5306801

Andrew Kerr, G. Diamos, S. Yalamanchili

{"title":"A characterization and analysis of PTX kernels","authors":"Andrew Kerr, G. Diamos, S. Yalamanchili","doi":"10.1109/IISWC.2009.5306801","DOIUrl":null,"url":null,"abstract":"General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA's CUDA [1], OpenCL [2], and Intel's Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design. This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC's Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA that is representative of ISAs for data parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification [4], and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch reconvergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"135","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2009.5306801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 135

Abstract

General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA's CUDA [1], OpenCL [2], and Intel's Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design. This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC's Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA that is representative of ISAs for data parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification [4], and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch reconvergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism.

查看原文本刊更多论文

PTX核的性质与分析

gpu通用应用程序开发(GPGPU)作为加速数据和计算密集型应用程序的一种经济有效的方法，最近获得了发展势头。它是由引入基于c语言的编程环境，如NVIDIA的CUDA[1]、OpenCL[2]和Intel的Ct[3]所推动的。虽然大量的工作集中在开发和评估应用程序和软件工具上，但相对较少的工作是用于分析和描述应用程序，以帮助将来在编译器优化、应用程序重构和微体系结构设计方面的工作。本文提出了一组GPU工作负载指标，并使用这些指标来分析GPU程序的行为。我们报告了对50多个内核和应用程序的分析，包括完整的NVIDIA CUDA SDK和UIUC的Parboil基准套件，涵盖控制流，数据流，并行性和内存行为。分析是使用我们开发的全功能模拟器执行的，该模拟器实现了NVIDIA虚拟机，称为PTX(并行线程执行架构)——一种机器模型和低级虚拟ISA，代表了数据并行执行的ISA。模拟器可以执行CUDA编译器编译的内核，目前支持完整的PTX 1.4规范[4]，并且已经针对完整的CUDA SDK进行了验证。结果量化了优化的重要性，比如分支再收敛、线程间共享的普遍性，并突出了额外并行性的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量