FastPersist: Accelerating Model Checkpointing in Deep Learning

Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He
{"title":"FastPersist: Accelerating Model Checkpointing in Deep Learning","authors":"Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He","doi":"arxiv-2406.13768","DOIUrl":null,"url":null,"abstract":"Model checkpoints are critical Deep Learning (DL) artifacts that enable fault\ntolerance for training and downstream applications, such as inference. However,\nwriting checkpoints to persistent storage, and other I/O aspects of DL\ntraining, are mostly ignored by compute-focused optimization efforts for faster\ntraining of rapidly growing models and datasets. Towards addressing this\nimbalance, we propose FastPersist to accelerate checkpoint creation in DL\ntraining. FastPersist combines three novel techniques: (i) NVMe optimizations\nfor faster checkpoint writes to SSDs, (ii) efficient write parallelism using\nthe available SSDs in training environments, and (iii) overlapping\ncheckpointing with independent training computations. Our evaluation using real\nworld dense and sparse DL models shows that FastPersist creates checkpoints in\npersistent storage up to 116x faster than baseline, and enables per-iteration\ncheckpointing with negligible overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"62 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.13768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Model checkpoints are critical Deep Learning (DL) artifacts that enable fault tolerance for training and downstream applications, such as inference. However, writing checkpoints to persistent storage, and other I/O aspects of DL training, are mostly ignored by compute-focused optimization efforts for faster training of rapidly growing models and datasets. Towards addressing this imbalance, we propose FastPersist to accelerate checkpoint creation in DL training. FastPersist combines three novel techniques: (i) NVMe optimizations for faster checkpoint writes to SSDs, (ii) efficient write parallelism using the available SSDs in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation using real world dense and sparse DL models shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
FastPersist:加速深度学习中的模型检查点
模型检查点是深度学习(DL)的关键工件,可为训练和推理等下游应用提供容错能力。然而,为了更快地训练快速增长的模型和数据集,以计算为重点的优化工作大多忽略了将检查点写入持久化存储以及 DL 训练的其他 I/O 方面。为了解决这一不平衡,我们提出了FastPersist,以加速DLtraining中的检查点创建。FastPersist 结合了三种新技术:(i) NVMe 优化,以更快地将检查点写入固态硬盘;(ii) 利用训练环境中可用的固态硬盘实现高效的写并行化;(iii) 将检查点与独立的训练计算重叠。我们使用真实世界的密集和稀疏 DL 模型进行的评估表明,FastPersist 在持久化存储中创建检查点的速度是基线速度的 116 倍,并能以可忽略不计的开销实现每迭代检查点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信