Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu
{"title":"ProTrain: Efficient LLM Training via Memory-Aware Techniques","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":null,"url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\nthis problem, existing work exploits the combination of CPU and GPU for the\ntraining process, such as ZeRO-Offload. Such a technique largely democratizes\nbillion-scale model training, making it possible to train with few consumer\ngraphics cards. However, based on our observation, existing frameworks often\nprovide coarse-grained memory management and require experienced experts in\nconfiguration tuning, leading to suboptimal hardware utilization and\nperformance. This paper proposes ProTrain, a novel training system that\nintelligently balances memory usage and performance by coordinating memory,\ncomputation, and IO. ProTrain achieves adaptive memory management through\nChunk-Based Model State Management and Block-Wise Activation Management, guided\nby a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\nchange the training algorithm and thus does not compromise accuracy.\nExperiments show that ProTrain improves training throughput by 1.43$\\times$ to\n2.71$\\times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.08334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
It is extremely memory-hungry to train Large Language Models (LLM). To solve
this problem, existing work exploits the combination of CPU and GPU for the
training process, such as ZeRO-Offload. Such a technique largely democratizes
billion-scale model training, making it possible to train with few consumer
graphics cards. However, based on our observation, existing frameworks often
provide coarse-grained memory management and require experienced experts in
configuration tuning, leading to suboptimal hardware utilization and
performance. This paper proposes ProTrain, a novel training system that
intelligently balances memory usage and performance by coordinating memory,
computation, and IO. ProTrain achieves adaptive memory management through
Chunk-Based Model State Management and Block-Wise Activation Management, guided
by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not
change the training algorithm and thus does not compromise accuracy.
Experiments show that ProTrain improves training throughput by 1.43$\times$ to
2.71$\times$ compared to the SOTA training systems.