Informed Prefetching in I/O Bounded Distributed Deep Learning

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00127

X. Ruan, Haiquan Chen

{"title":"Informed Prefetching in I/O Bounded Distributed Deep Learning","authors":"X. Ruan, Haiquan Chen","doi":"10.1109/IPDPSW52791.2021.00127","DOIUrl":null,"url":null,"abstract":"Deep learning research has been growing rapidly in the past decade for the significant performance improvement on GPUs. While the computing capability of current GPUs is tremendous, data pre-processing/loading becomes a potential bottleneck that incurs major training latency and adds overhead in both CPU and memory, especially when datasets are too large to fit in memory. When datasets are stripped on distributed file systems, access to a remote storage node may deteriorate I/O performance significantly due to network I/O latency in cloud. Moreover, some deep learning workloads may be assigned to remote GPU servers in Edge Computing which results in even higher network I/O latency. Therefore, it is desirable to provide efficient parallel and distributed prefetching solution which is able to reduce the I/O cost of data pre-processing before feeding the data into GPUs for training on distributed storage systems of Cloud or Edge. Although the current deep learning frameworks like PyTorch or TensorFlow offer multiprocessing data loading functionalities, their approaches come at the price of high computing resource usage and memory usage. In this paper, we presented a novel thread-level Informed Prefetching Data Loader framework, IPDL, in support of efficient data prefetching from remote storage nodes in distributed deep learning environments and possibly in Edge Computing. Compared to its counterparts in PyTorch, IPDL is able to provide accelerated I/O performance for data loading while consuming lower computing resource and memory space at the same time. Extensive experiments on both an individual server and a cluster computing system have shown the superiority of IPDL over the latest implementation of PyTorch.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Deep learning research has been growing rapidly in the past decade for the significant performance improvement on GPUs. While the computing capability of current GPUs is tremendous, data pre-processing/loading becomes a potential bottleneck that incurs major training latency and adds overhead in both CPU and memory, especially when datasets are too large to fit in memory. When datasets are stripped on distributed file systems, access to a remote storage node may deteriorate I/O performance significantly due to network I/O latency in cloud. Moreover, some deep learning workloads may be assigned to remote GPU servers in Edge Computing which results in even higher network I/O latency. Therefore, it is desirable to provide efficient parallel and distributed prefetching solution which is able to reduce the I/O cost of data pre-processing before feeding the data into GPUs for training on distributed storage systems of Cloud or Edge. Although the current deep learning frameworks like PyTorch or TensorFlow offer multiprocessing data loading functionalities, their approaches come at the price of high computing resource usage and memory usage. In this paper, we presented a novel thread-level Informed Prefetching Data Loader framework, IPDL, in support of efficient data prefetching from remote storage nodes in distributed deep learning environments and possibly in Edge Computing. Compared to its counterparts in PyTorch, IPDL is able to provide accelerated I/O performance for data loading while consuming lower computing resource and memory space at the same time. Extensive experiments on both an individual server and a cluster computing system have shown the superiority of IPDL over the latest implementation of PyTorch.

查看原文本刊更多论文

I/O边界分布式深度学习中的知情预取

深度学习研究在过去的十年中得到了迅速的发展，因为gpu的性能得到了显著的提高。虽然当前gpu的计算能力是巨大的，但数据预处理/加载成为一个潜在的瓶颈，它会导致主要的训练延迟，并增加CPU和内存的开销，特别是当数据集太大而无法容纳内存时。当在分布式文件系统上剥离数据集时，由于云环境下网络I/O延迟，访问远程存储节点可能会导致I/O性能显著下降。此外，在边缘计算中，一些深度学习工作负载可能会分配给远程GPU服务器，这将导致更高的网络I/O延迟。因此，需要提供高效的并行和分布式预取解决方案，在将数据馈送到gpu进行Cloud或Edge分布式存储系统上的训练之前，能够减少数据预处理的I/O成本。尽管当前的深度学习框架(如PyTorch或TensorFlow)提供了多处理数据加载功能，但它们的方法是以高计算资源和内存使用为代价的。在本文中，我们提出了一种新的线程级知情预取数据加载器框架IPDL，以支持分布式深度学习环境和边缘计算中远程存储节点的高效数据预取。与PyTorch中的对等体相比，IPDL能够为数据加载提供加速的I/O性能，同时消耗更低的计算资源和内存空间。在单个服务器和集群计算系统上进行的大量实验表明，IPDL优于PyTorch的最新实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量