Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-06-18 DOI:10.1145/3007787.3001201

M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram

{"title":"Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit","authors":"M. Yoon, Keunsoo Kim, Sangpil Lee, W. Ro, M. Annavaram","doi":"10.1145/3007787.3001201","DOIUrl":null,"url":null,"abstract":"Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"86 1","pages":"609-621"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

查看原文本刊更多论文

虚拟线程:最大化线程级并行超越GPU调度限制

现代gpu需要数以万计的并发线程才能充分利用海量的处理资源。然而，gpu中的线程并发性可能会由于线程调度结构(调度限制)的不足(如可用的程序计数器和单指令多线程堆栈)或片上内存(容量限制)的不足(如寄存器文件和共享内存)而减少。我们的评估表明，在实践中，在gpu上运行的许多通用应用程序中的并发性受到调度限制而不是容量限制的限制。在不过度增加调度复杂度的前提下，最大限度地利用片上存储资源是本文的主要目标。本文提出了一种虚拟线程(VT)体系结构，该体系结构将协作线程阵列(cta)分配到最大容量限制，而忽略调度限制。然而，为了降低并发管理更多线程的逻辑复杂性，我们建议将cta置于活动和非活动状态，这样活动cta的数量仍然尊重调度限制。当活动CTA中的所有翘曲遇到长延迟失速时，活动CTA将上下文切换出来，下一个就绪CTA将取代它的位置。我们利用了活动和非活动CTA仍然符合容量限制的事实，从而避免了保存和恢复大量CTA状态的需要。因此，VT显著降低了CTA交换的性能损失。通过在活动状态和非活动状态之间进行交换，VT可以在不增加逻辑复杂性的情况下利用更高程度的线程级并行性。仿真结果表明，VT平均提高了23.9%的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量