GPU优化:智能编译器和核心级重新配置

Deming Chen
{"title":"GPU优化:智能编译器和核心级重新配置","authors":"Deming Chen","doi":"10.1109/SLIP.2013.6681686","DOIUrl":null,"url":null,"abstract":"Summary form only given. Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications' performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements. Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.","PeriodicalId":385305,"journal":{"name":"2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizations in GPU: Smart compilers and core-level reconfiguration\",\"authors\":\"Deming Chen\",\"doi\":\"10.1109/SLIP.2013.6681686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications' performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements. Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.\",\"PeriodicalId\":385305,\"journal\":{\"name\":\"2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLIP.2013.6681686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLIP.2013.6681686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

只提供摘要形式。图形处理单元(gpu)对于通用并行处理性能越来越重要。GPU硬件由许多流多处理器组成,允许GPU并行执行数万个线程。然而,由于SIMD(单指令多数据)执行风格,如果计算线程必须采用分散的控制路径,则资源利用率和整体性能可能会受到显著影响。同时,调优GPU应用程序的性能也是一项复杂的劳动密集型任务。软件程序员使用各种优化技术来探索线程并行性和单个线程的性能之间的权衡。新的GPU架构还允许并发内核执行,这带来了有趣的内核调度问题。在演讲的第一部分,我们将主要介绍我们最近在控制流优化、寄存器分配和线程结构的联合优化以及并发内核调度方面的研究,以提高GPU的性能。通用计算用gpu的能效也越来越重要。在过去的5年里,将gpu集成到soc中用于移动设备,进一步加剧了减少gpu能量足迹的需求。在演讲的第二部分,我们提出了一种新的GPU架构,该架构利用重新配置来利用ILP和DVFS(动态电压和频率缩放)技术来降低功耗,而不会牺牲计算吞吐量。我们预计,与名义上基于cuda的架构相比,具有大量ILP的应用程序应该在能源和功耗方面有显着改善。除此之外,我们还预见到线程调度和CUDA warp结构和调度重组方面的有趣挑战。我们还注意到,SIMD单元(CUDA中的SM)内内核的动态重新配置会影响可以并发执行的线程数量,从而会改变飞行中有效扭曲的数量,这可能会影响重叠执行时间和内存延迟的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimizations in GPU: Smart compilers and core-level reconfiguration
Summary form only given. Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications' performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements. Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信