Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses
{"title":"重新定位和重新专用 GPU 工作负载以实现性能可移植性","authors":"Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses","doi":"10.1109/CGO57630.2024.10444828","DOIUrl":null,"url":null,"abstract":"In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"53 7","pages":"119-132"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Retargeting and Respecializing GPU Workloads for Performance Portability\",\"authors\":\"Ivan R. Ivanov, O. Zinenko, Jens Domke, Toshio Endo, William S. Moses\",\"doi\":\"10.1109/CGO57630.2024.10444828\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.\",\"PeriodicalId\":517814,\"journal\":{\"name\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"volume\":\"53 7\",\"pages\":\"119-132\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CGO57630.2024.10444828\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO57630.2024.10444828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
为了接近峰值性能,像 GPU 这样的加速器需要对特定架构进行大量调整,了解共享内存、并行性、张量内核等的可用性。遗憾的是,为了追求更高的性能和更低的成本,即使是同一供应商的架构设计也出现了严重的多样化。这就产生了在不同 GPU 之间实现性能可移植性的需求,尤其是对于采用特定编程模型并考虑到特定架构的程序而言,这一点尤为重要。即使程序可以在不同的架构上无缝执行,也可能会因为没有根据可用的硬件资源(如快速内存和寄存器)适当调整大小而导致性能下降,更不用说没有使用架构的最新高级功能了。我们提出了一种新方法,通过自动调整每个并行线程的工作量及其所需的内存和寄存器资源量,提高(传统)CUDA 程序在现代机器上的性能。通过在 MLIR 编译器基础架构内运行,我们还能从 CUDA 自动转换到 AMD GPU,同时调整程序粒度以适应目标 GPU 的大小。结合特定平台编译器辅助下的自动调整,我们的方法在 Rodinia 基准套件上比基准 CUDA 实现提高了 27% 的 geomean 速度,并且在执行相同 CUDA 程序的类似英伟达和 AMD GPU 之间实现了性能均等。
Retargeting and Respecializing GPU Workloads for Performance Portability
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture. We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs. Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.