GPGPU中的非结构化控制流

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.247

Rodrigo Dominguez, D. Kaeli

{"title":"GPGPU中的非结构化控制流","authors":"Rodrigo Dominguez, D. Kaeli","doi":"10.1109/IPDPSW.2013.247","DOIUrl":null,"url":null,"abstract":"The current trend toward heterogeneous architectures motivates us to reconsider current software and hardware paradigms. The focus is centered around new parallel programming models, compiler design, and runtime resource management techniques to exploit the features of many-core processor architectures. Graphics Processing Units (GPU) have become the platform of choice in this area for accelerating a large range of data-parallel and task-parallel applications. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as CUDA C and OpenCL. However, each vendor implements these programming models differently and we must analyze the internals in order to get a better understanding of the performance results. One of the main differences across implementations is the handling of program control flow by the compiler and the hardware. Some implementations can support unstructured control flow based on branches and labels; others are based on structured control flow relying solely on if-then and while constructs. In this paper we describe a tool that can be used to analyze the difference between these two approaches. We created a dynamic compiler called Caracal that translates applications with unstructured control flow so they can run on hardware that requires structured programs. In order to accomplish this, Caracal builds a control tree of the program and creates single-entry, single-exit regions called hammock graphs. We used this tool to analyze the performance differences between NVIDIA's implementation of CUDA C and AMD's implementation of OpenCL. We found that the requirement for structured control flow can increase the number of registers allocated by 20 registers and impact performance as much as 2x.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Unstructured Control Flow in GPGPU\",\"authors\":\"Rodrigo Dominguez, D. Kaeli\",\"doi\":\"10.1109/IPDPSW.2013.247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The current trend toward heterogeneous architectures motivates us to reconsider current software and hardware paradigms. The focus is centered around new parallel programming models, compiler design, and runtime resource management techniques to exploit the features of many-core processor architectures. Graphics Processing Units (GPU) have become the platform of choice in this area for accelerating a large range of data-parallel and task-parallel applications. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as CUDA C and OpenCL. However, each vendor implements these programming models differently and we must analyze the internals in order to get a better understanding of the performance results. One of the main differences across implementations is the handling of program control flow by the compiler and the hardware. Some implementations can support unstructured control flow based on branches and labels; others are based on structured control flow relying solely on if-then and while constructs. In this paper we describe a tool that can be used to analyze the difference between these two approaches. We created a dynamic compiler called Caracal that translates applications with unstructured control flow so they can run on hardware that requires structured programs. In order to accomplish this, Caracal builds a control tree of the program and creates single-entry, single-exit regions called hammock graphs. We used this tool to analyze the performance differences between NVIDIA's implementation of CUDA C and AMD's implementation of OpenCL. We found that the requirement for structured control flow can increase the number of registers allocated by 20 registers and impact performance as much as 2x.\",\"PeriodicalId\":234552,\"journal\":{\"name\":\"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2013.247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

当前异构架构的趋势促使我们重新考虑当前的软件和硬件范例。重点是新的并行编程模型、编译器设计和运行时资源管理技术，以利用多核处理器体系结构的特性。图形处理单元(GPU)已经成为该领域加速大量数据并行和任务并行应用的首选平台。GPU计算的迅速普及很大程度上得益于CUDA C和OpenCL等高级编程环境的引入。然而，每个供应商实现这些编程模型的方式不同，我们必须分析其内部结构，以便更好地理解性能结果。实现之间的主要区别之一是编译器和硬件对程序控制流的处理。一些实现可以支持基于分支和标签的非结构化控制流;其他的则是基于仅依赖if-then和while结构的结构化控制流。在本文中，我们描述了一个工具，可以用来分析这两种方法之间的差异。我们创建了一个名为Caracal的动态编译器，它可以转换具有非结构化控制流的应用程序，以便它们可以在需要结构化程序的硬件上运行。为了实现这一点，Caracal构建了程序的控制树，并创建了称为吊床图的单入口、单出口区域。我们使用这个工具来分析NVIDIA的CUDA C实现和AMD的OpenCL实现之间的性能差异。我们发现，对结构化控制流的需求可以使分配的寄存器数量增加20个，对性能的影响高达2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unstructured Control Flow in GPGPU

The current trend toward heterogeneous architectures motivates us to reconsider current software and hardware paradigms. The focus is centered around new parallel programming models, compiler design, and runtime resource management techniques to exploit the features of many-core processor architectures. Graphics Processing Units (GPU) have become the platform of choice in this area for accelerating a large range of data-parallel and task-parallel applications. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as CUDA C and OpenCL. However, each vendor implements these programming models differently and we must analyze the internals in order to get a better understanding of the performance results. One of the main differences across implementations is the handling of program control flow by the compiler and the hardware. Some implementations can support unstructured control flow based on branches and labels; others are based on structured control flow relying solely on if-then and while constructs. In this paper we describe a tool that can be used to analyze the difference between these two approaches. We created a dynamic compiler called Caracal that translates applications with unstructured control flow so they can run on hardware that requires structured programs. In order to accomplish this, Caracal builds a control tree of the program and creates single-entry, single-exit regions called hammock graphs. We used this tool to analyze the performance differences between NVIDIA's implementation of CUDA C and AMD's implementation of OpenCL. We found that the requirement for structured control flow can increase the number of registers allocated by 20 registers and impact performance as much as 2x.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量