基于编译器的TLP调制GPGPU性能标定方法(WIP论文)

Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems Pub Date : 2019-06-23 DOI:10.1145/3316482.3326343

Yongseung Yu, Seokwon Kang, Yongjun Park

{"title":"基于编译器的TLP调制GPGPU性能标定方法(WIP论文)","authors":"Yongseung Yu, Seokwon Kang, Yongjun Park","doi":"10.1145/3316482.3326343","DOIUrl":null,"url":null,"abstract":"Modern GPUs are the most successful accelerators as they provide outstanding performance gain by using CUDA or OpenCL programming models. For maximum performance, programmers typically try to maximize the number of thread blocks of target programs, and GPUs also generally attempt to allocate the maximum number of thread blocks to their GPU cores. However, many recent studies have pointed out that simply allocating the maximum number of thread blocks to GPU cores does not always guarantee the best performance, and identifying proper number of thread blocks per GPU core is a major challenge. Despite these studies, most existing architectural techniques cannot be directly applied to current GPU hardware, and the optimal number of thread blocks can vary significantly depending on the target GPU and application characteristics. To solve these problems, this study proposes a just-in-time thread block number adjustment system using CUDA binary modification upon an LLVM compiler framework, referred to as the CTA-Limiter, in order to dynamically maximize GPU performance on real GPUs without reprogramming. The framework gradually reduces the number of concurrent thread blocks of target CUDA workloads using extra shared memory allocation, and compares the execution time with the previous version to automatically identify the optimal number of co-running thread blocks per GPU Core. The results showed meaningful performance improvements, averaging at 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively.","PeriodicalId":256029,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)\",\"authors\":\"Yongseung Yu, Seokwon Kang, Yongjun Park\",\"doi\":\"10.1145/3316482.3326343\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern GPUs are the most successful accelerators as they provide outstanding performance gain by using CUDA or OpenCL programming models. For maximum performance, programmers typically try to maximize the number of thread blocks of target programs, and GPUs also generally attempt to allocate the maximum number of thread blocks to their GPU cores. However, many recent studies have pointed out that simply allocating the maximum number of thread blocks to GPU cores does not always guarantee the best performance, and identifying proper number of thread blocks per GPU core is a major challenge. Despite these studies, most existing architectural techniques cannot be directly applied to current GPU hardware, and the optimal number of thread blocks can vary significantly depending on the target GPU and application characteristics. To solve these problems, this study proposes a just-in-time thread block number adjustment system using CUDA binary modification upon an LLVM compiler framework, referred to as the CTA-Limiter, in order to dynamically maximize GPU performance on real GPUs without reprogramming. The framework gradually reduces the number of concurrent thread blocks of target CUDA workloads using extra shared memory allocation, and compares the execution time with the previous version to automatically identify the optimal number of co-running thread blocks per GPU Core. The results showed meaningful performance improvements, averaging at 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively.\",\"PeriodicalId\":256029,\"journal\":{\"name\":\"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3316482.3326343\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3316482.3326343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

现代gpu是最成功的加速器，因为它们通过使用CUDA或OpenCL编程模型提供了出色的性能增益。为了获得最佳性能，程序员通常会尝试最大化目标程序的线程块数量，而GPU通常也会尝试为其GPU内核分配最大数量的线程块。然而，最近的许多研究指出，简单地分配最大数量的线程块到GPU内核并不总是保证最佳性能，并且确定每个GPU内核的适当数量的线程块是一个主要挑战。尽管有这些研究，大多数现有的架构技术不能直接应用于当前的GPU硬件，并且线程块的最佳数量可能会根据目标GPU和应用程序的特性而有很大的不同。为了解决这些问题，本研究在LLVM编译器框架上提出了一种使用CUDA二进制修改的实时线程块数调整系统，称为CTA-Limiter，以便在真实GPU上动态最大化GPU性能，而无需重新编程。该框架使用额外的共享内存分配逐步减少目标CUDA工作负载的并发线程块数量，并将执行时间与以前的版本进行比较，以自动识别每个GPU核心的最佳协同运行线程块数量。结果显示，在GTX 960、GTX 1050和GTX 1080 Ti中，性能分别平均提高了30%、40%和44%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper)

Modern GPUs are the most successful accelerators as they provide outstanding performance gain by using CUDA or OpenCL programming models. For maximum performance, programmers typically try to maximize the number of thread blocks of target programs, and GPUs also generally attempt to allocate the maximum number of thread blocks to their GPU cores. However, many recent studies have pointed out that simply allocating the maximum number of thread blocks to GPU cores does not always guarantee the best performance, and identifying proper number of thread blocks per GPU core is a major challenge. Despite these studies, most existing architectural techniques cannot be directly applied to current GPU hardware, and the optimal number of thread blocks can vary significantly depending on the target GPU and application characteristics. To solve these problems, this study proposes a just-in-time thread block number adjustment system using CUDA binary modification upon an LLVM compiler framework, referred to as the CTA-Limiter, in order to dynamically maximize GPU performance on real GPUs without reprogramming. The framework gradually reduces the number of concurrent thread blocks of target CUDA workloads using extra shared memory allocation, and compares the execution time with the previous version to automatically identify the optimal number of co-running thread blocks per GPU Core. The results showed meaningful performance improvements, averaging at 30%, 40%, and 44%, in GTX 960, GTX 1050, and GTX 1080 Ti, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

自引率

0.00%

发文量