TBA：使用基于固态盘的激活卸载加快大型语言模型训练

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-08-19 DOI:arxiv-2408.10013

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu

{"title":"TBA：使用基于固态盘的激活卸载加快大型语言模型训练","authors":"Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu","doi":"arxiv-2408.10013","DOIUrl":null,"url":null,"abstract":"The growth rate of the GPU memory capacity has not been able to keep up with\nthat of the size of large language models (LLMs), hindering the model training\nprocess. In particular, activations -- the intermediate tensors produced during\nforward propagation and reused in backward propagation -- dominate the GPU\nmemory use. To address this challenge, we propose TBA to efficiently offload\nactivations to high-capacity NVMe SSDs. This approach reduces GPU memory usage\nwithout impacting performance by adaptively overlapping data transfers with\ncomputation. TBA is compatible with popular deep learning frameworks like\nPyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor\ndeduplication, forwarding, and adaptive offloading to further enhance\nefficiency. We conduct extensive experiments on GPT, BERT, and T5. Results\ndemonstrate that TBA effectively reduces 47% of the activation peak memory\nusage. At the same time, TBA perfectly overlaps the I/O with the computation\nand incurs negligible performance overhead. We introduce the\nrecompute-offload-keep (ROK) curve to compare the TBA offloading with other two\ntensor placement strategies, keeping activations in memory and layerwise full\nrecomputation. We find that TBA achieves better memory savings than layerwise\nfull recomputation while retaining the performance of keeping the activations\nin memory.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading\",\"authors\":\"Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu\",\"doi\":\"arxiv-2408.10013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The growth rate of the GPU memory capacity has not been able to keep up with\\nthat of the size of large language models (LLMs), hindering the model training\\nprocess. In particular, activations -- the intermediate tensors produced during\\nforward propagation and reused in backward propagation -- dominate the GPU\\nmemory use. To address this challenge, we propose TBA to efficiently offload\\nactivations to high-capacity NVMe SSDs. This approach reduces GPU memory usage\\nwithout impacting performance by adaptively overlapping data transfers with\\ncomputation. TBA is compatible with popular deep learning frameworks like\\nPyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor\\ndeduplication, forwarding, and adaptive offloading to further enhance\\nefficiency. We conduct extensive experiments on GPT, BERT, and T5. Results\\ndemonstrate that TBA effectively reduces 47% of the activation peak memory\\nusage. At the same time, TBA perfectly overlaps the I/O with the computation\\nand incurs negligible performance overhead. We introduce the\\nrecompute-offload-keep (ROK) curve to compare the TBA offloading with other two\\ntensor placement strategies, keeping activations in memory and layerwise full\\nrecomputation. We find that TBA achieves better memory savings than layerwise\\nfull recomputation while retaining the performance of keeping the activations\\nin memory.\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.10013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

GPU 内存容量的增长速度一直跟不上大型语言模型（LLM）的大小，从而阻碍了模型的训练过程。特别是激活（activations）--在前向传播过程中产生并在后向传播中重复使用的中间张量--在GPU内存的使用中占主导地位。为了应对这一挑战，我们提出了 TBA 方法，将激活有效地卸载到大容量 NVMe SSD 上。这种方法通过自适应地将数据传输与计算重叠，在不影响性能的情况下减少了 GPU 内存的使用。TBA兼容PyTorch、Megatron和DeepSpeed等流行的深度学习框架，并采用了重复数据传输、转发和自适应卸载等技术来进一步提高效率。我们在 GPT、BERT 和 T5 上进行了大量实验。结果表明，TBA 有效降低了 47% 的激活峰值内存用量。同时，TBA 将 I/O 与计算完美地重叠在一起，产生的性能开销可以忽略不计。我们引入了计算-卸载-保持（ROK）曲线，将 TBA 卸载与其他双传感器放置策略（将激活保持在内存中和分层全计算）进行比较。我们发现，与分层全重新计算相比，TBA 能更好地节省内存，同时保留内存中激活的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. To address this challenge, we propose TBA to efficiently offload activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results demonstrate that TBA effectively reduces 47% of the activation peak memory usage. At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead. We introduce the recompute-offload-keep (ROK) curve to compare the TBA offloading with other two tensor placement strategies, keeping activations in memory and layerwise full recomputation. We find that TBA achieves better memory savings than layerwise full recomputation while retaining the performance of keeping the activations in memory.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量