Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software Pub Date : 2024-03-05 DOI:10.1145/3648633

Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, Alena Shilova

{"title":"Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Alexis Joly, Alena Shilova","doi":"10.1145/3648633","DOIUrl":null,"url":null,"abstract":"Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"44 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3648633","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.

This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.

We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.

This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.

查看原文本刊更多论文

异构链的最佳再物化策略：如何利用有限的内存训练深度神经网络

前馈深度神经网络的训练是一项内存密集型操作，通常在内存容量有限的 GPU 上进行。如果数据不适合 GPU 内存，数据科学家可能不得不限制模型的深度或输入数据的分辨率。再物化技术的理念来自自动微分文献中开发的检查点策略，它允许数据科学家以增加计算成本为代价，限制与中间数据（激活）存储相关的内存需求。它包括在前向阶段选择保存哪些激活，删除哪些激活，然后在后向阶段需要时重新计算被删除的激活。我们提出了一种新颖的计算模型，它结合了两种激活保存方式：或只保存层输入，或记录产生输出的完整操作历史。本文的重点是完全异构的情况，即每个层的计算时间和内存需求都不同。我们证明，找到最优解是 NP 难的，自动微分文献中的经典技术并不适用。此外，用于简化最优解搜索的物化激活记忆持久性经典假设也不再成立。因此，我们提出了一个弱记忆持久性属性，并提供了一个动态程序来计算最优计算序列。该算法可通过 Rotor 软件实现，它是一个 PyTorch 插件，可处理任何由层级序列组成的网络，每个层级都具有任意复杂的结构。通过广泛的实验，我们发现，在大量网络、图像大小和批量大小的情况下，我们的实现始终优于现有的再物化方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Mathematical Software 工程技术-计算机：软件工程

CiteScore

5.00

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： As a scientific journal, ACM Transactions on Mathematical Software (TOMS) documents the theoretical underpinnings of numeric, symbolic, algebraic, and geometric computing applications. It focuses on analysis and construction of algorithms and programs, and the interaction of programs and architecture. Algorithms documented in TOMS are available as the Collected Algorithms of the ACM at calgo.acm.org.