流式应用程序到gpu的自动架构感知映射

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.52

A. Hagiescu, Huynh Phung Huynh, W. Wong, R. Goh

{"title":"流式应用程序到gpu的自动架构感知映射","authors":"A. Hagiescu, Huynh Phung Huynh, W. Wong, R. Goh","doi":"10.1109/IPDPS.2011.52","DOIUrl":null,"url":null,"abstract":"Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called {\\em warps} and {\\em wave fronts}, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of {\\em compute} and {\\em memory access} threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. \\% small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs\",\"authors\":\"A. Hagiescu, Huynh Phung Huynh, W. Wong, R. Goh\",\"doi\":\"10.1109/IPDPS.2011.52\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called {\\\\em warps} and {\\\\em wave fronts}, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of {\\\\em compute} and {\\\\em memory access} threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. \\\\% small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.\",\"PeriodicalId\":355100,\"journal\":{\"name\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2011.52\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.52","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

图形处理单元(gpu)由许多流多处理器组成，每个多处理器由交错执行大量线程的处理核心组成。线程组(在nVidia和AMD的文献中分别称为{\em warp}和{\em wave front})由硬件调度器选择，并在可用的内核上同步执行。如果这样一个组中的线程访问慢速的片外全局内存，则整个组必须停止，而另一个组将被调度。如果有足够数量的可选线程组可供选择，则给定多处理器的利用率将保持较高。许多并行通用应用程序已经被有效地映射到gpu上。不幸的是，许多流处理应用程序表现出不利的数据移动模式和较低的计算与通信比率，这可能导致较差的性能。在本文中，我们描述了一个自动编译流程，通过考虑nVidia gpu的两个重要架构特征，即交错执行以及每个流多处理器中可用的少量共享内存，将大多数流处理应用程序映射到gpu上。特别是，我们表明，使用少量的计算线程，以减少内存占用，我们可以实现GPU内核的高利用率。我们的方案违背了GPU编程的传统智慧，即使用大量同构线程。相反，它混合使用{\em计算}和{\em内存访问}线程，以及精心设计的调度，利用流应用程序中的并行性，同时最大限度地提高独特内存层次结构的有效性。位于每个流多处理器内的小片上存储器。我们已经在Stream It编程语言的编译器中实现了我们的方案，与最先进的解决方案相比，我们的结果显示了显著的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs

Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called {\em warps} and {\em wave fronts}, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of {\em compute} and {\em memory access} threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. \% small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量