Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

Saumay Dublish, V. Nagarajan, N. Topham
{"title":"Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/HPCA.2019.00061","DOIUrl":null,"url":null,"abstract":"GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning
平衡:使用机器学习在gpu中平衡线程级并行性和内存系统性能
gpu采用高度的线程级并行性(TLP)来隐藏内存操作的长延迟。然而,随之而来的对内存系统需求的增加会导致诸如缓存抖动和带宽瓶颈之类的病态影响。因此,高度的TLP会对系统吞吐量产生不利影响。在本文中,我们提出了平衡gpu中TLP和存储系统性能的新方法Poise。Poise有两个主要组成部分:一个机器学习框架和一个硬件推理引擎。机器学习框架包括一个回归模型,该模型在一组概要内核上进行离线训练,以学习最佳的warp调度决策。在运行时,硬件推理引擎使用先前学习的模型动态预测未见过的应用程序的最佳翘曲调度决策。因此,Poise可以帮助优化全新的应用程序,而不会给最终用户带来任何分析、培训或编程负担。在训练期间未见的一组基准测试中,Poise实现了高达2.94倍的加速,谐波平均加速为46.6%,超过了基线贪婪的最古老的经纱调度器。Poise非常轻量级,每个SM的硬件开销最小,约为41字节。它还使总能耗平均降低51.6%。此外,Poise的性能比之前最先进的曲速调度器平均高出15.1%。实际上,Poise以相当的精度和效率解决了复杂的硬件优化问题。Keywords-warp调度;缓存;机器学习
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信