Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI:10.1109/ipdps53621.2022.00078

Zhuoran Ji, Cho-Li Wang

{"title":"Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption","authors":"Zhuoran Ji, Cho-Li Wang","doi":"10.1109/ipdps53621.2022.00078","DOIUrl":null,"url":null,"abstract":"GPUs are widely used in data centers to accelerate data-parallel applications. The multiuser and multitasking environment provides a strong incentive for preemptive GPU multitasking, especially for latency-sensitive jobs. Due to the large contexts of GPU kernels, preemptive GPU context switching is costly. Many novel GPU preemption techniques are proposed. Among them, checkpoint-based GPU preemption enables low latency GPU preemption but incurs a high runtime overhead. Prior studies propose to exclude dead registers from the checkpoint file to reduce the runtime overhead. It works well for CPUs, but it is not rare that a live register is not updated between two checkpoints for GPU kernels. This paper presents TripleC, a compiler-directed incremental checkpointing technique specially designed for GPU preemption. It further excludes the registers, which have not been overwritten since the last time they were spilled, from the checkpoint file with data flow analysis. The checkpoint placement algorithm of TripleC can properly estimate a checkpoint's cost under incremental checkpointing. It also considers the interaction among checkpoints so that the overall cost is minimized. Moreover, TripleC relaxes the conventional checkpointing constraint that the whole register context must be spilled before passing the checkpoint. Because of the diverse control flow, placing a register spilling instruction at different points incurs different costs. TripleC minimizes the cost with a two-phase algorithm that schedules these register spilling instructions at compilation time. Evaluations show that TripleC reduces the runtime overhead by 12.9 % on average compared with the state-of-the-art non-incremental checkpointing approach.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

GPUs are widely used in data centers to accelerate data-parallel applications. The multiuser and multitasking environment provides a strong incentive for preemptive GPU multitasking, especially for latency-sensitive jobs. Due to the large contexts of GPU kernels, preemptive GPU context switching is costly. Many novel GPU preemption techniques are proposed. Among them, checkpoint-based GPU preemption enables low latency GPU preemption but incurs a high runtime overhead. Prior studies propose to exclude dead registers from the checkpoint file to reduce the runtime overhead. It works well for CPUs, but it is not rare that a live register is not updated between two checkpoints for GPU kernels. This paper presents TripleC, a compiler-directed incremental checkpointing technique specially designed for GPU preemption. It further excludes the registers, which have not been overwritten since the last time they were spilled, from the checkpoint file with data flow analysis. The checkpoint placement algorithm of TripleC can properly estimate a checkpoint's cost under incremental checkpointing. It also considers the interaction among checkpoints so that the overall cost is minimized. Moreover, TripleC relaxes the conventional checkpointing constraint that the whole register context must be spilled before passing the checkpoint. Because of the diverse control flow, placing a register spilling instruction at different points incurs different costs. TripleC minimizes the cost with a two-phase algorithm that schedules these register spilling instructions at compilation time. Evaluations show that TripleC reduces the runtime overhead by 12.9 % on average compared with the state-of-the-art non-incremental checkpointing approach.

查看原文本刊更多论文

编译器定向增量检查点低延迟GPU抢占

gpu被广泛应用于数据中心，以加速数据并行应用。多用户和多任务环境为抢占式GPU多任务提供了强烈的激励，特别是对于延迟敏感的作业。由于GPU内核的上下文很大，抢占式GPU上下文切换的成本很高。提出了许多新的GPU抢占技术。其中，基于检查点的GPU抢占延时较低，但运行时开销较大。先前的研究建议从检查点文件中排除死寄存器以减少运行时开销。它适用于cpu，但在GPU内核的两个检查点之间不更新活动寄存器的情况并不少见。本文介绍了TripleC，一种专门为GPU抢占而设计的编译器导向的增量检查点技术。它还通过数据流分析从检查点文件中排除寄存器(自上次溢出以来未被覆盖的寄存器)。TripleC的检查点放置算法可以很好地估计增量检查点下检查点的成本。它还考虑了检查点之间的交互，从而使总成本最小化。此外，TripleC放宽了传统的检查点约束，即在通过检查点之前必须溢出整个寄存器上下文。由于控制流的不同，在不同的点放置寄存器溢出指令会产生不同的开销。TripleC通过在编译时调度这些寄存器溢出指令的两阶段算法将成本降至最低。评估表明，与最先进的非增量检查点方法相比，TripleC平均减少了12.9%的运行时开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量