Convergence-aware optimal checkpointing for exploratory deep learning training jobs

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Hongliang Li , Zichen Wang , Hairui Zhao , Meng Zhang , Xiang Li , Haixiao Xu
{"title":"Convergence-aware optimal checkpointing for exploratory deep learning training jobs","authors":"Hongliang Li ,&nbsp;Zichen Wang ,&nbsp;Hairui Zhao ,&nbsp;Meng Zhang ,&nbsp;Xiang Li ,&nbsp;Haixiao Xu","doi":"10.1016/j.future.2024.107597","DOIUrl":null,"url":null,"abstract":"<div><div>Training Deep Learning (DL) models are becoming more time-consuming, thus interruptions to the training processes are inevitable. We can obtain an optimal checkpointing interval to minimize the fault tolerance overhead for a HPC (High Performance Computing) job with the precondition that the job progress is proportional to its execution time. Unfortunately, it is not the case in DL model training, where a DL training job yields diminishing returns across its lifetime. Meanwhile, training DL models is inherently exploratory, with early termination frequently occurring during model training&amp;developing. It makes the early progress of a DL training job more valuable than the later ones. Even placement of checkpoints would either increase the risks in the early stages or waste resources overprotecting the latter stages. Moreover, in data parallelism, the state-of-the-art quality-driven scheduling strategies allocate more resources for the early stages of a job than the later ones to accelerate the training progress, which further amplifies the issue. In summary, the early stage is more important than the later stages. Allocating more fault-tolerant resources to the early stages is beneficial for the model exploration. Based on the aforementioned conclusion, we present COCI, an approach to compute optimal checkpointing configuration for a exploratory DL training job, minimizing the fault tolerance overhead, including checkpoint cost and recovery cost. We implement COCI based on state-of-the-art iteration-level checkpointing mechanism, as a pluggable module compatible with PyTorch without extra user input. The experimental results show that COCI reduces up to 40.18% fault tolerance overhead compared to existing state-of-the-art DL fault tolerance methods in serial scenario, 60.64% in data parallel scenario.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107597"},"PeriodicalIF":6.2000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005612","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Training Deep Learning (DL) models are becoming more time-consuming, thus interruptions to the training processes are inevitable. We can obtain an optimal checkpointing interval to minimize the fault tolerance overhead for a HPC (High Performance Computing) job with the precondition that the job progress is proportional to its execution time. Unfortunately, it is not the case in DL model training, where a DL training job yields diminishing returns across its lifetime. Meanwhile, training DL models is inherently exploratory, with early termination frequently occurring during model training&developing. It makes the early progress of a DL training job more valuable than the later ones. Even placement of checkpoints would either increase the risks in the early stages or waste resources overprotecting the latter stages. Moreover, in data parallelism, the state-of-the-art quality-driven scheduling strategies allocate more resources for the early stages of a job than the later ones to accelerate the training progress, which further amplifies the issue. In summary, the early stage is more important than the later stages. Allocating more fault-tolerant resources to the early stages is beneficial for the model exploration. Based on the aforementioned conclusion, we present COCI, an approach to compute optimal checkpointing configuration for a exploratory DL training job, minimizing the fault tolerance overhead, including checkpoint cost and recovery cost. We implement COCI based on state-of-the-art iteration-level checkpointing mechanism, as a pluggable module compatible with PyTorch without extra user input. The experimental results show that COCI reduces up to 40.18% fault tolerance overhead compared to existing state-of-the-art DL fault tolerance methods in serial scenario, 60.64% in data parallel scenario.
针对探索性深度学习训练工作的收敛感知优化检查点功能
深度学习(DL)模型的训练越来越耗时,因此训练过程的中断不可避免。我们可以获得最佳的检查点间隔,从而最大限度地减少 HPC(高性能计算)作业的容错开销,前提条件是作业进度与其执行时间成正比。遗憾的是,在 DL 模型训练中情况并非如此,DL 训练作业在其整个生命周期中的收益是递减的。同时,DL 模型的训练本质上是探索性的,在模型训练和开发过程中经常会出现提前终止的情况。这使得 DL 训练工作的早期进展比后期进展更有价值。即使设置检查点,要么会增加早期阶段的风险,要么会浪费资源过度保护后期阶段。此外,在数据并行的情况下,最先进的质量驱动调度策略会为作业的早期阶段分配比后期阶段更多的资源,以加快训练进度,这进一步加剧了问题的严重性。总之,早期阶段比后期阶段更重要。为早期阶段分配更多容错资源有利于模型探索。基于上述结论,我们提出了 COCI,这是一种为探索性 DL 训练作业计算最佳检查点配置的方法,能最大限度地减少容错开销,包括检查点成本和恢复成本。我们基于最先进的迭代级检查点机制实现了 COCI,它是与 PyTorch 兼容的可插拔模块,无需用户额外输入。实验结果表明,与现有最先进的 DL 容错方法相比,COCI 在串行场景下减少了 40.18% 的容错开销,在数据并行场景下减少了 60.64%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信