Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu
{"title":"ProTrain:通过记忆感知技术进行高效 LLM 训练","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":null,"url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\nthis problem, existing work exploits the combination of CPU and GPU for the\ntraining process, such as ZeRO-Offload. Such a technique largely democratizes\nbillion-scale model training, making it possible to train with few consumer\ngraphics cards. However, based on our observation, existing frameworks often\nprovide coarse-grained memory management and require experienced experts in\nconfiguration tuning, leading to suboptimal hardware utilization and\nperformance. This paper proposes ProTrain, a novel training system that\nintelligently balances memory usage and performance by coordinating memory,\ncomputation, and IO. ProTrain achieves adaptive memory management through\nChunk-Based Model State Management and Block-Wise Activation Management, guided\nby a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\nchange the training algorithm and thus does not compromise accuracy.\nExperiments show that ProTrain improves training throughput by 1.43$\\times$ to\n2.71$\\times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProTrain: Efficient LLM Training via Memory-Aware Techniques\",\"authors\":\"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu\",\"doi\":\"arxiv-2406.08334\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is extremely memory-hungry to train Large Language Models (LLM). To solve\\nthis problem, existing work exploits the combination of CPU and GPU for the\\ntraining process, such as ZeRO-Offload. Such a technique largely democratizes\\nbillion-scale model training, making it possible to train with few consumer\\ngraphics cards. However, based on our observation, existing frameworks often\\nprovide coarse-grained memory management and require experienced experts in\\nconfiguration tuning, leading to suboptimal hardware utilization and\\nperformance. This paper proposes ProTrain, a novel training system that\\nintelligently balances memory usage and performance by coordinating memory,\\ncomputation, and IO. ProTrain achieves adaptive memory management through\\nChunk-Based Model State Management and Block-Wise Activation Management, guided\\nby a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\\nchange the training algorithm and thus does not compromise accuracy.\\nExperiments show that ProTrain improves training throughput by 1.43$\\\\times$ to\\n2.71$\\\\times$ compared to the SOTA training systems.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.08334\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.08334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
训练大型语言模型(LLM)非常耗费内存。为了解决这个问题,现有的工作利用 CPU 和 GPU 的组合来完成训练过程,例如 ZeRO-Offload。这种技术在很大程度上实现了亿万级模型训练的民主化,使使用少量消费级显卡进行训练成为可能。然而,根据我们的观察,现有框架通常提供粗粒度内存管理,需要经验丰富的专家进行配置调整,导致硬件利用率和性能达不到最优。本文提出的 ProTrain 是一种新型训练系统,它通过协调内存、计算和 IO,智能地平衡内存使用和性能。ProTrain 通过基于大块的模型状态管理(Chunk-Based Model State Management)和基于块的激活管理(Block-Wise Activation Management)实现了自适应内存管理,并由内存感知运行时分析器提供指导,无需用户干预。实验表明,与 SOTA 训练系统相比,ProTrain 将训练吞吐量提高了 1.43 倍到 2.71 倍。
ProTrain: Efficient LLM Training via Memory-Aware Techniques
It is extremely memory-hungry to train Large Language Models (LLM). To solve
this problem, existing work exploits the combination of CPU and GPU for the
training process, such as ZeRO-Offload. Such a technique largely democratizes
billion-scale model training, making it possible to train with few consumer
graphics cards. However, based on our observation, existing frameworks often
provide coarse-grained memory management and require experienced experts in
configuration tuning, leading to suboptimal hardware utilization and
performance. This paper proposes ProTrain, a novel training system that
intelligently balances memory usage and performance by coordinating memory,
computation, and IO. ProTrain achieves adaptive memory management through
Chunk-Based Model State Management and Block-Wise Activation Management, guided
by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not
change the training algorithm and thus does not compromise accuracy.
Experiments show that ProTrain improves training throughput by 1.43$\times$ to
2.71$\times$ compared to the SOTA training systems.