PEP: proactive checkpointing for efficient preemption on GPUs

Chen Li, Andrew Zigerelli, Jun Yang, Yang Guo
{"title":"PEP: proactive checkpointing for efficient preemption on GPUs","authors":"Chen Li, Andrew Zigerelli, Jun Yang, Yang Guo","doi":"10.1109/DAC.2018.8465929","DOIUrl":null,"url":null,"abstract":"The demand for multitasking GPUs increases whenever the GPU may be shared by multiple applications, either spatially or temporally. This requires that GPUs can be preempted and switch context to a new application while already executing one. Unlike CPUs, context switching in GPUs is prohibitively expensive due to the large context states to swap out. There have been a number of efforts on reducing the overhead of preemption, through reducing the context sizes or overlapping context switching with execution. All those techniques are reactive approaches, meaning that context switching occurs when the preemption request arrives.In this paper, we propose a proactive mechanism to reduce the latency of preemption. We observe that kernel execution is almost always preceded by known commands in both CUDA and OpenCL implementations. Hence, a preemption can be anticipated before the actual request arrives. We study such lead time and develop a prediction scheme to perform an early state saving. When the actual preemption is invoked, an incremental update relative to the previous saved state is performed, much like the conventional checkpointing mechanism. This design effectively reduces the stall time of the preempting kernel due to context switching by 58.6%. Moreover, through careful handling of the saved state, we can also reduce the overall size of saved state by an average of 23.3%, compared with a full context switching.","PeriodicalId":87346,"journal":{"name":"Proceedings. Design Automation Conference","volume":"21 1","pages":"114:1-114:6"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Design Automation Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAC.2018.8465929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The demand for multitasking GPUs increases whenever the GPU may be shared by multiple applications, either spatially or temporally. This requires that GPUs can be preempted and switch context to a new application while already executing one. Unlike CPUs, context switching in GPUs is prohibitively expensive due to the large context states to swap out. There have been a number of efforts on reducing the overhead of preemption, through reducing the context sizes or overlapping context switching with execution. All those techniques are reactive approaches, meaning that context switching occurs when the preemption request arrives.In this paper, we propose a proactive mechanism to reduce the latency of preemption. We observe that kernel execution is almost always preceded by known commands in both CUDA and OpenCL implementations. Hence, a preemption can be anticipated before the actual request arrives. We study such lead time and develop a prediction scheme to perform an early state saving. When the actual preemption is invoked, an incremental update relative to the previous saved state is performed, much like the conventional checkpointing mechanism. This design effectively reduces the stall time of the preempting kernel due to context switching by 58.6%. Moreover, through careful handling of the saved state, we can also reduce the overall size of saved state by an average of 23.3%, compared with a full context switching.
PEP:在gpu上有效抢占的主动检查点
当GPU可能被多个应用程序(无论是空间上还是时间上)共享时,对多任务GPU的需求就会增加。这要求gpu可以被抢占,并在已经执行一个应用程序时将上下文切换到一个新的应用程序。与cpu不同,gpu中的上下文切换代价高昂,因为要交换的上下文状态很大。通过减少上下文大小或在执行时重叠上下文切换,已经进行了许多减少抢占开销的工作。所有这些技术都是响应式方法,这意味着在抢占请求到达时发生上下文切换。在本文中,我们提出了一种减少抢占延迟的主动机制。我们观察到,在CUDA和OpenCL实现中,内核执行几乎总是先有已知的命令。因此,在实际请求到达之前可以预见到抢占。我们研究了这种提前期,并开发了一种预测方案来执行早期状态节省。当调用实际的抢占时,将执行相对于前一个保存状态的增量更新,这与传统的检查点机制非常相似。这种设计有效地减少了由于上下文切换导致的抢占内核的停顿时间58.6%。此外,通过仔细处理保存的状态,与完全上下文切换相比,我们还可以将保存状态的总体大小平均减少23.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信