FastPersist: Accelerating Model Checkpointing in Deep Learning

arXiv - CS - Performance Pub Date : 2024-06-19 DOI:arxiv-2406.13768

Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He

引用次数: 0

Abstract

Model checkpoints are critical Deep Learning (DL) artifacts that enable fault tolerance for training and downstream applications, such as inference. However, writing checkpoints to persistent storage, and other I/O aspects of DL training, are mostly ignored by compute-focused optimization efforts for faster training of rapidly growing models and datasets. Towards addressing this imbalance, we propose FastPersist to accelerate checkpoint creation in DL training. FastPersist combines three novel techniques: (i) NVMe optimizations for faster checkpoint writes to SSDs, (ii) efficient write parallelism using the available SSDs in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation using real world dense and sparse DL models shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.

查看原文本刊更多论文

FastPersist：加速深度学习中的模型检查点

模型检查点是深度学习（DL）的关键工件，可为训练和推理等下游应用提供容错能力。然而，为了更快地训练快速增长的模型和数据集，以计算为重点的优化工作大多忽略了将检查点写入持久化存储以及 DL 训练的其他 I/O 方面。为了解决这一不平衡，我们提出了FastPersist，以加速DLtraining中的检查点创建。FastPersist 结合了三种新技术：(i) NVMe 优化，以更快地将检查点写入固态硬盘；(ii) 利用训练环境中可用的固态硬盘实现高效的写并行化；(iii) 将检查点与独立的训练计算重叠。我们使用真实世界的密集和稀疏 DL 模型进行的评估表明，FastPersist 在持久化存储中创建检查点的速度是基线速度的 116 倍，并能以可忽略不计的开销实现每迭代检查点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量