ZeRO:面向训练万亿参数模型的内存优化

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
{"title":"ZeRO:面向训练万亿参数模型的内存优化","authors":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He","doi":"10.1109/SC41405.2020.00024","DOIUrl":null,"url":null,"abstract":"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"563","resultStr":"{\"title\":\"ZeRO: Memory optimizations Toward Training Trillion Parameter Models\",\"authors\":\"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He\",\"doi\":\"10.1109/SC41405.2020.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"125 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"563\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 563

摘要

大型深度学习模型提供了显著的准确性提高,但训练数十亿到数万亿的参数是具有挑战性的。现有的解决方案,如数据和模型并行,在获得计算、通信和开发效率的同时,在将这些模型适应有限的设备内存方面表现出基本的局限性。我们开发了一种新颖的解决方案,零冗余优化器(Zero),以优化内存,大大提高了训练速度,同时增加了可以有效训练的模型大小。ZeRO消除了数据和模型并行训练中的内存冗余,同时保持了低通信量和高计算粒度,使我们能够以持续的高效率按比例缩放模型大小。我们对内存需求和通信量的分析表明:使用当今的硬件,ZeRO有可能扩展到超过1万亿个参数。我们实现和评估ZeRO:它在400个gpu上以超线性加速训练超过100B参数的大型模型,实现15 Petaflops的吞吐量。这意味着模型尺寸增加了8倍,可实现性能增加了10倍。在可用性方面,ZeRO可以训练多达13B个参数的大型模型(例如,比威震天GPT 8更大)。3B和T5 (11B),而不需要模型并行,这对科学家来说更难应用。最后但并非最不重要的是,研究人员利用ZeRO的系统突破创建了图灵- nlg,这是当时世界上最大的语言模型(17B个参数),其准确性打破了纪录。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ZeRO: Memory optimizations Toward Training Trillion Parameter Models
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8. 3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create Turing-NLG, the world’s largest language model at the time (17B parameters) with record breaking accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信