PRM: An Efficient Partial Recovery Method to Accelerate Training Data Reconstruction for Distributed Deep Learning Applications in Cloud Storage Systems

2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS) Pub Date : 2022-06-10 DOI:10.1109/IWQoS54832.2022.9812919

Piao Hu, Yunfei Gu, Ranhao Jia, Chentao Wu, Minyi Guo, Jie Li

{"title":"PRM: An Efficient Partial Recovery Method to Accelerate Training Data Reconstruction for Distributed Deep Learning Applications in Cloud Storage Systems","authors":"Piao Hu, Yunfei Gu, Ranhao Jia, Chentao Wu, Minyi Guo, Jie Li","doi":"10.1109/IWQoS54832.2022.9812919","DOIUrl":null,"url":null,"abstract":"Distributed deep learning is a typical machine learning method running in distributed environment such as cloud computing systems. The corresponding training, validation and test datasets are very large in general (e.g., several TBs), which need to be stored across multiple data nodes. Due to the high disk failure ratio in cloud storage systems, one of the critical issues for distributed deep learning is how to efficiently tolerate disk failures in the training procedures. These failures can lead to a large amount of data loss, which decreases the training accuracy and slows down the training process. Although several recovery methods are proposed to accelerate the data reconstruction, the related overhead is extremely high, such as high CPU/GPU utilization, a large number of I/Os, etc.To address the above problems, we propose a novel Partial-Recovery Method (called PRM) , which is an adaptive recovery method to accelerate data reconstruction for distributed deep learning applications in cloud storage systems. The key idea of PRM is combining the advantages of erasure coding’s ability to obtain global information on the data distribution with the AI’s ability to recover partial lost data, which can sharply reduce the overhead with acceptable training accuracy. To demonstrate the effectiveness of the PRM approach, we conduct several experiments. The results show that, compared to the state-of-the-art full or approximate recovery methods, PRM decreases the average network transmission time overhead by up to 64.50%, and reduces the recovery time by up to 55.90%, respectively.","PeriodicalId":353365,"journal":{"name":"2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWQoS54832.2022.9812919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed deep learning is a typical machine learning method running in distributed environment such as cloud computing systems. The corresponding training, validation and test datasets are very large in general (e.g., several TBs), which need to be stored across multiple data nodes. Due to the high disk failure ratio in cloud storage systems, one of the critical issues for distributed deep learning is how to efficiently tolerate disk failures in the training procedures. These failures can lead to a large amount of data loss, which decreases the training accuracy and slows down the training process. Although several recovery methods are proposed to accelerate the data reconstruction, the related overhead is extremely high, such as high CPU/GPU utilization, a large number of I/Os, etc.To address the above problems, we propose a novel Partial-Recovery Method (called PRM) , which is an adaptive recovery method to accelerate data reconstruction for distributed deep learning applications in cloud storage systems. The key idea of PRM is combining the advantages of erasure coding’s ability to obtain global information on the data distribution with the AI’s ability to recover partial lost data, which can sharply reduce the overhead with acceptable training accuracy. To demonstrate the effectiveness of the PRM approach, we conduct several experiments. The results show that, compared to the state-of-the-art full or approximate recovery methods, PRM decreases the average network transmission time overhead by up to 64.50%, and reduces the recovery time by up to 55.90%, respectively.

查看原文本刊更多论文

PRM:一种加速云存储系统中分布式深度学习应用训练数据重构的高效部分恢复方法

分布式深度学习是一种典型的运行在云计算系统等分布式环境下的机器学习方法。相应的训练、验证和测试数据集通常非常大(例如，几个tb)，需要跨多个数据节点存储。由于云存储系统中磁盘的高故障率，如何在训练过程中有效地容忍磁盘故障是分布式深度学习的关键问题之一。这些故障会导致大量的数据丢失，从而降低了训练的准确性，减慢了训练的速度。虽然提出了几种加速数据重建的恢复方法，但相关的开销非常高，如CPU/GPU的高利用率，大量的I/ o等。针对上述问题，我们提出了一种新的部分恢复方法(称为PRM)，这是一种加速云存储系统中分布式深度学习应用数据重建的自适应恢复方法。PRM的核心思想是将擦除编码获取数据分布全局信息的优势与人工智能恢复部分丢失数据的能力相结合，从而在训练精度可接受的情况下大幅降低开销。为了证明PRM方法的有效性，我们进行了几个实验。结果表明，与目前最先进的完全或近似恢复方法相比，PRM将平均网络传输时间开销减少了64.50%，将恢复时间减少了55.90%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)

自引率

0.00%

发文量