{"title":"Less Memory Means smaller GPUs: Backpropagation with Compressed Activations","authors":"Daniel Barley, Holger Fröning","doi":"arxiv-2409.11902","DOIUrl":null,"url":null,"abstract":"The ever-growing scale of deep neural networks (DNNs) has lead to an equally\nrapid growth in computational resource requirements. Many recent architectures,\nmost prominently Large Language Models, have to be trained using supercomputers\nwith thousands of accelerators, such as GPUs or TPUs. Next to the vast number\nof floating point operations the memory footprint of DNNs is also exploding. In\ncontrast, GPU architectures are notoriously short on memory. Even comparatively\nsmall architectures like some EfficientNet variants cannot be trained on a\nsingle consumer-grade GPU at reasonable mini-batch sizes. During training,\nintermediate input activations have to be stored until backpropagation for\ngradient calculation. These make up the vast majority of the memory footprint.\nIn this work we therefore consider compressing activation maps for the backward\npass using pooling, which can reduce both the memory footprint and amount of\ndata movement. The forward computation remains uncompressed. We empirically\nshow convergence and study effects on feature detection at the example of the\ncommon vision architecture ResNet. With this approach we are able to reduce the\npeak memory consumption by 29% at the cost of a longer training schedule, while\nmaintaining prediction accuracy compared to an uncompressed baseline.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11902","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The ever-growing scale of deep neural networks (DNNs) has lead to an equally
rapid growth in computational resource requirements. Many recent architectures,
most prominently Large Language Models, have to be trained using supercomputers
with thousands of accelerators, such as GPUs or TPUs. Next to the vast number
of floating point operations the memory footprint of DNNs is also exploding. In
contrast, GPU architectures are notoriously short on memory. Even comparatively
small architectures like some EfficientNet variants cannot be trained on a
single consumer-grade GPU at reasonable mini-batch sizes. During training,
intermediate input activations have to be stored until backpropagation for
gradient calculation. These make up the vast majority of the memory footprint.
In this work we therefore consider compressing activation maps for the backward
pass using pooling, which can reduce both the memory footprint and amount of
data movement. The forward computation remains uncompressed. We empirically
show convergence and study effects on feature detection at the example of the
common vision architecture ResNet. With this approach we are able to reduce the
peak memory consumption by 29% at the cost of a longer training schedule, while
maintaining prediction accuracy compared to an uncompressed baseline.