ProTrain：通过记忆感知技术进行高效 LLM 训练

arXiv - CS - Performance Pub Date : 2024-06-12 DOI:arxiv-2406.08334

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

{"title":"ProTrain：通过记忆感知技术进行高效 LLM 训练","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":null,"url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\nthis problem, existing work exploits the combination of CPU and GPU for the\ntraining process, such as ZeRO-Offload. Such a technique largely democratizes\nbillion-scale model training, making it possible to train with few consumer\ngraphics cards. However, based on our observation, existing frameworks often\nprovide coarse-grained memory management and require experienced experts in\nconfiguration tuning, leading to suboptimal hardware utilization and\nperformance. This paper proposes ProTrain, a novel training system that\nintelligently balances memory usage and performance by coordinating memory,\ncomputation, and IO. ProTrain achieves adaptive memory management through\nChunk-Based Model State Management and Block-Wise Activation Management, guided\nby a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\nchange the training algorithm and thus does not compromise accuracy.\nExperiments show that ProTrain improves training throughput by 1.43$\\times$ to\n2.71$\\times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProTrain: Efficient LLM Training via Memory-Aware Techniques\",\"authors\":\"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu\",\"doi\":\"arxiv-2406.08334\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is extremely memory-hungry to train Large Language Models (LLM). To solve\\nthis problem, existing work exploits the combination of CPU and GPU for the\\ntraining process, such as ZeRO-Offload. Such a technique largely democratizes\\nbillion-scale model training, making it possible to train with few consumer\\ngraphics cards. However, based on our observation, existing frameworks often\\nprovide coarse-grained memory management and require experienced experts in\\nconfiguration tuning, leading to suboptimal hardware utilization and\\nperformance. This paper proposes ProTrain, a novel training system that\\nintelligently balances memory usage and performance by coordinating memory,\\ncomputation, and IO. ProTrain achieves adaptive memory management through\\nChunk-Based Model State Management and Block-Wise Activation Management, guided\\nby a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\\nchange the training algorithm and thus does not compromise accuracy.\\nExperiments show that ProTrain improves training throughput by 1.43$\\\\times$ to\\n2.71$\\\\times$ compared to the SOTA training systems.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.08334\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.08334","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

训练大型语言模型（LLM）非常耗费内存。为了解决这个问题，现有的工作利用 CPU 和 GPU 的组合来完成训练过程，例如 ZeRO-Offload。这种技术在很大程度上实现了亿万级模型训练的民主化，使使用少量消费级显卡进行训练成为可能。然而，根据我们的观察，现有框架通常提供粗粒度内存管理，需要经验丰富的专家进行配置调整，导致硬件利用率和性能达不到最优。本文提出的 ProTrain 是一种新型训练系统，它通过协调内存、计算和 IO，智能地平衡内存使用和性能。ProTrain 通过基于大块的模型状态管理（Chunk-Based Model State Management）和基于块的激活管理（Block-Wise Activation Management）实现了自适应内存管理，并由内存感知运行时分析器提供指导，无需用户干预。实验表明，与 SOTA 训练系统相比，ProTrain 将训练吞吐量提高了 1.43 倍到 2.71 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ProTrain: Efficient LLM Training via Memory-Aware Techniques

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量