Checkflow: Low-Overhead Checkpointing for Deep Learning Training

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2025-08-07 DOI:10.1109/LCA.2025.3596616

Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng

{"title":"Checkflow: Low-Overhead Checkpointing for Deep Learning Training","authors":"Hangyu Liu;Shouxi Luo;Ke Li;Huanlai Xing;Bo Peng","doi":"10.1109/LCA.2025.3596616","DOIUrl":null,"url":null,"abstract":"During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of <inline-formula><tex-math>$i)$</tex-math></inline-formula> decoupling a tensor’s checkpoint operation into snapshot-then-offload, and <inline-formula><tex-math>$ii)$</tex-math></inline-formula> scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"281-284"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11119290/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

During the time-consuming training of deep neural network (DNN) models, the worker has to periodically create checkpoints for tensors like the model parameters and optimizer state to support fast failover. However, due to the high overhead of checkpointing, existing schemes generally create checkpoints at a very low frequency, making recovery inefficient since the unsaved training progress would get lost. In this paper, we propose Checkflow, a low-overhead checkpointing scheme, which enables per-iteration checkpointing for DNN training with minimal or even zero cost of training slowdown. The power of Checkflow stems from the design of

$i)$

decoupling a tensor’s checkpoint operation into snapshot-then-offload, and

$ii)$

scheduling these operations appropriately, following the results of the math models. Our experimental results imply that, when the GPU-CPU connection has sufficient bandwidth for the training workload, Checkflow can theoretically overlap all the checkpoint operations for each round of training with the training computation, with trivial or no overhead in peak GPU memory occupancy.

查看原文本刊更多论文

Checkflow：用于深度学习训练的低开销检查点

在耗时的深度神经网络（DNN）模型训练过程中，工作者必须定期为模型参数和优化器状态等张量创建检查点以支持快速故障转移。然而，由于检查点的高开销，现有方案通常以非常低的频率创建检查点，使得恢复效率低下，因为未保存的训练进度会丢失。在本文中，我们提出了Checkflow，这是一种低开销的检查点方案，它使DNN训练的每次迭代检查点具有最小甚至零成本的训练速度。Checkflow的强大源于以下设计：1)将张量的检查点操作解耦为快照-然后卸载，2)根据数学模型的结果适当地调度这些操作。我们的实验结果表明，当GPU- cpu连接对训练工作负载有足够的带宽时，Checkflow理论上可以将每轮训练的所有检查点操作与训练计算重叠，在峰值GPU内存占用上很少或没有开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.