大型变压器模型训练的全面特征描述和分析

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI:10.1145/3639034

Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir

{"title":"大型变压器模型训练的全面特征描述和分析","authors":"Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir","doi":"10.1145/3639034","DOIUrl":null,"url":null,"abstract":"Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"358 1","pages":"8:1-8:25"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Thorough Characterization and Analysis of Large Transformer Model Training At-Scale\",\"authors\":\"Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir\",\"doi\":\"10.1145/3639034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.\",\"PeriodicalId\":335883,\"journal\":{\"name\":\"Proc. ACM Meas. Anal. Comput. Syst.\",\"volume\":\"358 1\",\"pages\":\"8:1-8:25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. ACM Meas. Anal. Comput. Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3639034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. ACM Meas. Anal. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3639034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型变压器模型最近在各个领域都取得了巨大成功。随着模型参数越来越多，如今的大型变压器模型训练通常涉及模型分片、数据并行和模型并行。因此，大规模模型训练的吞吐量在很大程度上取决于网络带宽，因为模型分片和多种并行策略的组合会产生各种成本。然而，之前在高带宽 DGX 机器上使用 TFLOPS 作为衡量标准的变压器模型特性可能无法反映带宽较低系统的性能。此外，数据和模型并行性在不同系统带宽上显示出明显不同的大规模训练情况，因此需要进行深入研究。在本文中，我们自下而上地将训练吞吐量分解为计算时间和通信时间，并定量分析了它们各自对整个端到端训练规模的影响。我们的评估涉及对数据并行性的深入探讨，在带宽有限的情况下可扩展至 512 个 GPU，并在六种模型大小中考察了三种模型分片策略。我们还在高带宽和低带宽超级计算系统上评估了三种模型并行性组合。总之，我们的工作为大规模变压器模型训练提供了更广阔的视角，我们的分析和评估为预测训练规模提供了实用的见解，对超级计算系统设计的未来发展起着决定性作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. ACM Meas. Anal. Comput. Syst.

自引率

0.00%

发文量