走向快速崩溃一致的集群检查点

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926330

Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, S. Cohen, S. Sundararaman, Daniel Waddington, Peter Chin

{"title":"走向快速崩溃一致的集群检查点","authors":"Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, S. Cohen, S. Sundararaman, Daniel Waddington, Peter Chin","doi":"10.1109/HPEC55821.2022.9926330","DOIUrl":null,"url":null,"abstract":"Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Fast Crash-Consistent Cluster Checkpointing\",\"authors\":\"Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, S. Cohen, S. Sundararaman, Daniel Waddington, Peter Chin\",\"doi\":\"10.1109/HPEC55821.2022.9926330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.\",\"PeriodicalId\":200071,\"journal\":{\"name\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC55821.2022.9926330\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

机器学习模型的训练成本很高:它们需要昂贵的高计算硬件，并且训练时间很长。因此，模型对程序错误或意外的系统崩溃格外敏感，这可能会抹掉数小时甚至数天的工作。虽然有很多策略旨在降低系统意外停机的风险，但机器学习中最流行的策略被称为检查点:定期将模型的状态保存到持久存储中。检查点是一种有效的策略，但是，它需要仔细平衡两个操作:检查点的创建频率(检查点计划)，以及创建检查点本身的成本。在本文中，我们利用Python内存管理器(PyMM)，它为持久性内存和新兴的持久性内存技术(Optane DC)提供Python支持，以加速检查点操作，同时保持崩溃一致性。我们首先展示了当使用检查点模型时，具有持久内存的PyMM可以节省检查点运行时的时间，从几分钟到几天。然后，我们使用PyMM进一步优化检查点操作，并在两个真实数据集(MNIST和MusicNet)上使用KMeans和高斯混合模型算法演示我们的方法。通过评估，我们表明，相对于当前最先进的检查点方法，这两种算法对于KMeans实现了10到75倍的检查点加速，对于GMM实现了3倍以上的检查点加速。我们还验证我们的解决方案可以从崩溃中恢复，而传统方法则不能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Fast Crash-Consistent Cluster Checkpointing

Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量