{"title":"基于课程学习和稀疏注意的GPT大模型LanYUAN","authors":"Gonghai Zhou, Yuhong Zhang, Rizhen Hu, Yang Zhang","doi":"10.1145/3603781.3603827","DOIUrl":null,"url":null,"abstract":"In 2021, the Inspur AI Research Institute introduced the AI Megatron Model Yuan-1.0, a massive Chinese language AI model containing 245.7 billion parameters. This model surpassed OpenAI's GPT-3, making it the world's largest Chinese NLP model. Although the model was pre-trained using Nvidia's Megatron framework with model parallelism, data parallelism, and pipelining optimizations, there is still room for improvement in terms of training time, cost, and convergence. To achieve better performance, this paper investigates the impacts of batch size and learning rate on model training time and accuracy to balance model performance. We replaced the pipelining optimization with the more efficient DeepSpeed framework, and combined DeepSpeed's ZeRO-based data parallelism with Nvidia's Megatron-LM model parallelism to achieve higher performance on Nvidia GPU clusters with high-bandwidth interconnects. Additionally, we used a curriculum learning-based method and four types of sparse attention as a new optimization approaches. The results showed that the training time was reduced by 20% and the throughput increased by 20% compared to the 47 billion parameters Yuan-1.0 model. Approximately, the optimized model achieved performance improvement in downstream tasks with the same training data.","PeriodicalId":391180,"journal":{"name":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","volume":"265 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LanYUAN, a GPT large model using Curriculum Learning and Sparse Attention\",\"authors\":\"Gonghai Zhou, Yuhong Zhang, Rizhen Hu, Yang Zhang\",\"doi\":\"10.1145/3603781.3603827\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In 2021, the Inspur AI Research Institute introduced the AI Megatron Model Yuan-1.0, a massive Chinese language AI model containing 245.7 billion parameters. This model surpassed OpenAI's GPT-3, making it the world's largest Chinese NLP model. Although the model was pre-trained using Nvidia's Megatron framework with model parallelism, data parallelism, and pipelining optimizations, there is still room for improvement in terms of training time, cost, and convergence. To achieve better performance, this paper investigates the impacts of batch size and learning rate on model training time and accuracy to balance model performance. We replaced the pipelining optimization with the more efficient DeepSpeed framework, and combined DeepSpeed's ZeRO-based data parallelism with Nvidia's Megatron-LM model parallelism to achieve higher performance on Nvidia GPU clusters with high-bandwidth interconnects. Additionally, we used a curriculum learning-based method and four types of sparse attention as a new optimization approaches. The results showed that the training time was reduced by 20% and the throughput increased by 20% compared to the 47 billion parameters Yuan-1.0 model. Approximately, the optimized model achieved performance improvement in downstream tasks with the same training data.\",\"PeriodicalId\":391180,\"journal\":{\"name\":\"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things\",\"volume\":\"265 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3603781.3603827\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603781.3603827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LanYUAN, a GPT large model using Curriculum Learning and Sparse Attention
In 2021, the Inspur AI Research Institute introduced the AI Megatron Model Yuan-1.0, a massive Chinese language AI model containing 245.7 billion parameters. This model surpassed OpenAI's GPT-3, making it the world's largest Chinese NLP model. Although the model was pre-trained using Nvidia's Megatron framework with model parallelism, data parallelism, and pipelining optimizations, there is still room for improvement in terms of training time, cost, and convergence. To achieve better performance, this paper investigates the impacts of batch size and learning rate on model training time and accuracy to balance model performance. We replaced the pipelining optimization with the more efficient DeepSpeed framework, and combined DeepSpeed's ZeRO-based data parallelism with Nvidia's Megatron-LM model parallelism to achieve higher performance on Nvidia GPU clusters with high-bandwidth interconnects. Additionally, we used a curriculum learning-based method and four types of sparse attention as a new optimization approaches. The results showed that the training time was reduced by 20% and the throughput increased by 20% compared to the 47 billion parameters Yuan-1.0 model. Approximately, the optimized model achieved performance improvement in downstream tasks with the same training data.