ProTrain: Efficient LLM Training via Memory-Aware Techniques

arXiv - CS - Performance Pub Date : 2024-06-12 DOI:arxiv-2406.08334

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

引用次数: 0

Abstract

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.

查看原文本刊更多论文

ProTrain：通过记忆感知技术进行高效 LLM 训练

训练大型语言模型（LLM）非常耗费内存。为了解决这个问题，现有的工作利用 CPU 和 GPU 的组合来完成训练过程，例如 ZeRO-Offload。这种技术在很大程度上实现了亿万级模型训练的民主化，使使用少量消费级显卡进行训练成为可能。然而，根据我们的观察，现有框架通常提供粗粒度内存管理，需要经验丰富的专家进行配置调整，导致硬件利用率和性能达不到最优。本文提出的 ProTrain 是一种新型训练系统，它通过协调内存、计算和 IO，智能地平衡内存使用和性能。ProTrain 通过基于大块的模型状态管理（Chunk-Based Model State Management）和基于块的激活管理（Block-Wise Activation Management）实现了自适应内存管理，并由内存感知运行时分析器提供指导，无需用户干预。实验表明，与 SOTA 训练系统相比，ProTrain 将训练吞吐量提高了 1.43 倍到 2.71 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量