相位感知的Warp调度:减轻GPGPU应用程序中相位行为的影响

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.31

Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover

{"title":"相位感知的Warp调度:减轻GPGPU应用程序中相位行为的影响","authors":"Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover","doi":"10.1109/PACT.2015.31","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications\",\"authors\":\"Mihir Awatramani, Xian Zhu, Joseph Zambreno, D. Rover\",\"doi\":\"10.1109/PACT.2015.31\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.\",\"PeriodicalId\":385398,\"journal\":{\"name\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2015.31\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

图形处理单元(gpu)已被广泛采用为高性能计算的加速器，因为它们提供的计算吞吐量比对应的CPU高得多。由于GPU架构针对吞吐量进行了优化，因此它们并行执行大量SIMD线程(warp)，并使用硬件多线程来隐藏管道和内存访问延迟。虽然两级轮询调度(TLRR)和贪婪然后最老(GTO)的warp调度策略在学术研究界被广泛接受，但对于哪种策略最适合所有应用程序，还没有达成共识。在本文中，我们证明了哪个调度策略更有效的差异取决于应用程序的不同区域(阶段)指令的特征。我们在编译时识别这些阶段，并设计了一个新的warp调度策略，该策略使用有关它们的信息在运行时做出调度决策。通过减轻应用程序阶段行为的不利影响，我们的策略对于每个应用程序的执行总是更接近于两个现有策略中较好的一个。我们在Rodinia和CUDA SDK基准套件的35个内核上评估了warp调度器的性能。对于使用GTO调度器性能更好的应用程序，我们的warp调度器与GTO的性能匹配，准确率达到99.2%，平均加速比RR提高6.31%。同样，对于使用RR性能更好的应用程序，我们的调度器的性能在RR的98%以内，并且比GTO实现了6.65%的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications

Graphics Processing Units (GPUs) have been widely adopted as accelerators for high performance computing due to the immense amount of computational throughput they offer over their CPU counterparts. As GPU architectures are optimized for throughput, they execute a large number of SIMD threads (warps) in parallel and use hardware multithreading to hide the pipeline and memory access latencies. While the Two-Level Round Robin (TLRR) and Greedy Then Oldest (GTO) warp scheduling policies have been widely accepted in the academic research community, there is no consensus regarding which policy works best for all applications. In this paper, we show that the disparity regarding which scheduling policy works better depends on the characteristics of instructions in different regions (phases) of the application. We identify these phases at compile time and design a novel warp scheduling policy that uses information regarding them to make scheduling decisions at runtime. By mitigating the adverse effects of application phase behavior, our policy always performs closer to the better of the two existing policies for each application. We evaluate the performance of the warp schedulers on 35 kernels from the Rodinia and CUDA SDK benchmark suites. For applications that have a better performance with the GTO scheduler, our warp scheduler matches the performance of GTO with 99.2% accuracy and achieves an average speedup of 6.31% over RR. Similarly, for applications that perform better with RR, the performance of our scheduler is within of 98% of RR and achieves an average speedup of 6.65% over GTO.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量