GPU memory usage optimization for backward propagation in deep network training

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2025-02-11 DOI:10.1016/j.jpdc.2025.105053

Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu

{"title":"GPU memory usage optimization for backward propagation in deep network training","authors":"Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu","doi":"10.1016/j.jpdc.2025.105053","DOIUrl":null,"url":null,"abstract":"<div><div>In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology <em>rematerialization</em> can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as <em>checkpoints</em>, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the <em>checkpoint selection</em> problem and propose a dynamic programming algorithm with time complexity <span><math><mi>O</mi><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></math></span> to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an <span><math><mi>O</mi><mo>(</mo><mi>n</mi><mo>)</mo></math></span>-time algorithm for finding the optimal checkpoint subset.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105053"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525000206","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity

O (n^{3})

to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an

O (n)

-time algorithm for finding the optimal checkpoint subset.

查看原文本刊更多论文

深度网络训练中向后传播的GPU内存使用优化

在现代深度学习中，设计更大的深度神经网络（dnn）来执行更复杂的任务和更高的准确性已经成为一种趋势。另一方面，卷积神经网络（cnn）已经成为大多数计算机视觉任务的标准方法。然而，卷积层中间数据的内存分配在模型训练过程中会造成严重的内存压力。已经提出了许多解决这个问题的办法。除了依赖于硬件的解决方案外，通用的方法重物化可以通过有效地将计算转换为内存来减少GPU内存的使用。其思想是在转发阶段选择一组中间结果作为检查点，并仅将它们保存在内存中以减少内存使用。向后阶段根据需要从内存中最近的检查点重新计算中间数据。这种重新计算增加了执行时间，但由于在转发阶段没有将所有中间结果存储在内存中，因此节省了内存。在本文中，我们将专注于在模型训练过程中有效地找到最优检查点子集，以实现最小的峰值内存使用。我们首先用数学方程描述神经网络训练的理论背景。我们使用这些方程来确定前向和后向阶段所需的所有基本数据，以计算模型的权重梯度。我们首先识别了检查点选择问题，提出了一种时间复杂度为0 （n3）的动态规划算法来解决寻找最优检查点子集的问题。通过大量的实验，我们利用我们的理论分析和基于跟踪的目标函数来制定更准确的问题描述，并提出了一个O(n)时间算法来寻找最优检查点子集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.