参数服务器的无损故障恢复与超轻复制

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2021-07-01 DOI:10.1109/ICDCS51616.2021.00052

Yangyang Zhang, Jianxin Li, Yiming Zhang, Lijie Wang, Ling Liu

{"title":"参数服务器的无损故障恢复与超轻复制","authors":"Yangyang Zhang, Jianxin Li, Yiming Zhang, Lijie Wang, Ling Liu","doi":"10.1109/ICDCS51616.2021.00052","DOIUrl":null,"url":null,"abstract":"Modern distributed machine learning (ML) systems leverage large-scale computing infrastructures to achieve fast model training. For many servers jointly training a model, failure recovery becomes an important challenge when a training task could be accomplished in minutes rather than days. The state-of-the-art checkpointing mechanism cannot meet the need of efficient recovery for large-scale ML, because its high cost prevents timely checkpointing and a server failure will likely cause a substantial loss of intermediate results when the checkpointing intervals are comparable to the entire training times. This paper proposes FreeLauncher (FLR), a lossless recovery mechanism for large-scale ML which performs ultralight replication (instead of checkpointing) to guarantee all intermediate training results (parameters) to be timely replicated. Our key insight is that in the parameter-server (PS) architecture there already exist multiple copies for each intermediate result not only in the server but also in the workers, most of which are qualified for failure recovery. FLR addresses the challenges of parameter sparsity (e.g., when training LDA) and staleness (e.g., when adopting relaxed consistency) by selectively replicating the latest copies of the sparse/stale parameters to ensure at least k up-to-date copies to be existent, which can handle any k-1 failures by re-launching the failed servers with recovered parameters from workers. We implement FLR on Tensorflow. Evaluation results show that FLR achieves lossless failure recovery (almost requiring no recomputation) at little cost.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication\",\"authors\":\"Yangyang Zhang, Jianxin Li, Yiming Zhang, Lijie Wang, Ling Liu\",\"doi\":\"10.1109/ICDCS51616.2021.00052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern distributed machine learning (ML) systems leverage large-scale computing infrastructures to achieve fast model training. For many servers jointly training a model, failure recovery becomes an important challenge when a training task could be accomplished in minutes rather than days. The state-of-the-art checkpointing mechanism cannot meet the need of efficient recovery for large-scale ML, because its high cost prevents timely checkpointing and a server failure will likely cause a substantial loss of intermediate results when the checkpointing intervals are comparable to the entire training times. This paper proposes FreeLauncher (FLR), a lossless recovery mechanism for large-scale ML which performs ultralight replication (instead of checkpointing) to guarantee all intermediate training results (parameters) to be timely replicated. Our key insight is that in the parameter-server (PS) architecture there already exist multiple copies for each intermediate result not only in the server but also in the workers, most of which are qualified for failure recovery. FLR addresses the challenges of parameter sparsity (e.g., when training LDA) and staleness (e.g., when adopting relaxed consistency) by selectively replicating the latest copies of the sparse/stale parameters to ensure at least k up-to-date copies to be existent, which can handle any k-1 failures by re-launching the failed servers with recovered parameters from workers. We implement FLR on Tensorflow. Evaluation results show that FLR achieves lossless failure recovery (almost requiring no recomputation) at little cost.\",\"PeriodicalId\":222376,\"journal\":{\"name\":\"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS51616.2021.00052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS51616.2021.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

现代分布式机器学习(ML)系统利用大规模计算基础设施来实现快速的模型训练。对于联合训练模型的许多服务器来说，当训练任务可以在几分钟而不是几天内完成时，故障恢复就成为一个重要的挑战。最先进的检查点机制不能满足大规模机器学习高效恢复的需要，因为它的高成本阻止了及时的检查点，并且当检查点间隔与整个训练时间相当时，服务器故障可能会导致中间结果的大量损失。本文提出了一种用于大规模机器学习的无损恢复机制——自由启动器(FLR)，它执行超轻复制(而不是检查点)来保证所有中间训练结果(参数)的及时复制。我们的关键见解是，在参数服务器(PS)体系结构中，每个中间结果已经存在多个副本，不仅在服务器中，而且在工作器中，其中大多数都有资格进行故障恢复。FLR解决了参数稀疏性(例如，当训练LDA时)和过时性(例如，当采用放松一致性时)的挑战，通过有选择性地复制稀疏/过时参数的最新副本，以确保存在至少k个最新副本，这可以通过重新启动失败的服务器来处理任何k-1失败。我们在Tensorflow上实现FLR。评估结果表明，FLR以较低的成本实现了无损故障恢复(几乎不需要重新计算)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication

Modern distributed machine learning (ML) systems leverage large-scale computing infrastructures to achieve fast model training. For many servers jointly training a model, failure recovery becomes an important challenge when a training task could be accomplished in minutes rather than days. The state-of-the-art checkpointing mechanism cannot meet the need of efficient recovery for large-scale ML, because its high cost prevents timely checkpointing and a server failure will likely cause a substantial loss of intermediate results when the checkpointing intervals are comparable to the entire training times. This paper proposes FreeLauncher (FLR), a lossless recovery mechanism for large-scale ML which performs ultralight replication (instead of checkpointing) to guarantee all intermediate training results (parameters) to be timely replicated. Our key insight is that in the parameter-server (PS) architecture there already exist multiple copies for each intermediate result not only in the server but also in the workers, most of which are qualified for failure recovery. FLR addresses the challenges of parameter sparsity (e.g., when training LDA) and staleness (e.g., when adopting relaxed consistency) by selectively replicating the latest copies of the sparse/stale parameters to ensure at least k up-to-date copies to be existent, which can handle any k-1 failures by re-launching the failed servers with recovered parameters from workers. We implement FLR on Tensorflow. Evaluation results show that FLR achieves lossless failure recovery (almost requiring no recomputation) at little cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量