Exploiting CXL-based Memory for Distributed Deep Learning

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545054

Moiz Arif, Kevin Assogba, M. M. Rafique, Sudharshan S. Vazhkudai

{"title":"Exploiting CXL-based Memory for Distributed Deep Learning","authors":"Moiz Arif, Kevin Assogba, M. M. Rafique, Sudharshan S. Vazhkudai","doi":"10.1145/3545008.3545054","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) is being widely used to solve complex problems in scientific applications from diverse domains, such as weather forecasting, medical diagnostics, and fluid dynamics simulation. DL applications consume a large amount of data using large-scale high-performance computing (HPC) systems to train a given model. These workloads have large memory and storage requirements that typically go beyond the limited amount of main memory available on an HPC server. This significantly increases the overall training time as the input training data and model parameters are frequently swapped to slower storage tiers during the training process. In this paper, we use the latest advancements in the memory subsystem, specifically Compute Express Link (CXL), to provide additional memory and fast scratch space for DL workloads to reduce the overall training time while enabling DL jobs to efficiently train models using data that is much larger than the installed system memory. We propose a framework, called DeepMemoryDL, that manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, and provides intelligent prefetching and caching mechanisms for DL workloads. We implement and integrate DeepMemoryDL with a popular DL platform, TensorFlow, to show that our approach reduces read and write latencies, improves the overall I/O throughput, and reduces the training time. Our evaluation shows a performance improvement of up to 34% and 27% compared to the default TensorFlow platform and CXL-based memory expansion approaches, respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"333 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Deep learning (DL) is being widely used to solve complex problems in scientific applications from diverse domains, such as weather forecasting, medical diagnostics, and fluid dynamics simulation. DL applications consume a large amount of data using large-scale high-performance computing (HPC) systems to train a given model. These workloads have large memory and storage requirements that typically go beyond the limited amount of main memory available on an HPC server. This significantly increases the overall training time as the input training data and model parameters are frequently swapped to slower storage tiers during the training process. In this paper, we use the latest advancements in the memory subsystem, specifically Compute Express Link (CXL), to provide additional memory and fast scratch space for DL workloads to reduce the overall training time while enabling DL jobs to efficiently train models using data that is much larger than the installed system memory. We propose a framework, called DeepMemoryDL, that manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, and provides intelligent prefetching and caching mechanisms for DL workloads. We implement and integrate DeepMemoryDL with a popular DL platform, TensorFlow, to show that our approach reduces read and write latencies, improves the overall I/O throughput, and reduces the training time. Our evaluation shows a performance improvement of up to 34% and 27% compared to the default TensorFlow platform and CXL-based memory expansion approaches, respectively.

查看原文本刊更多论文

利用基于cxl的内存进行分布式深度学习

深度学习(DL)正被广泛用于解决不同领域的科学应用中的复杂问题，如天气预报、医疗诊断和流体动力学模拟。深度学习应用程序使用大规模高性能计算(HPC)系统消耗大量数据来训练给定的模型。这些工作负载具有大量内存和存储需求，通常超出了HPC服务器上可用的有限主内存。这大大增加了总体训练时间，因为在训练过程中，输入训练数据和模型参数经常被交换到较慢的存储层。在本文中，我们使用内存子系统的最新进展，特别是计算快速链接(CXL)，为深度学习工作负载提供额外的内存和快速刻划空间，以减少总体训练时间，同时使深度学习作业能够使用比安装的系统内存大得多的数据有效地训练模型。我们提出了一个名为DeepMemoryDL的框架，它可以管理额外的基于cxl的内存分配，引入快速的中间存储层，并为DL工作负载提供智能预取和缓存机制。我们将DeepMemoryDL与流行的深度学习平台TensorFlow实现并集成，以表明我们的方法减少了读写延迟，提高了整体I/O吞吐量，并减少了训练时间。我们的评估显示，与默认的TensorFlow平台和基于cxl的内存扩展方法相比，性能分别提高了34%和27%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量