Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":null,"url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\nto sparse computation through expert routing, selectively activating only a\nsmall subset of expert modules. However, sparse computation challenges\ntraditional training practices, as discrete expert routing hinders standard\nbackpropagation and thus gradient-based optimization, which are the cornerstone\nof deep learning. To better pursue the scaling power of MoE, we introduce GRIN\n(GRadient-INformed MoE training), which incorporates sparse gradient estimation\nfor expert routing and configures model parallelism to avoid token dropping.\nApplying GRIN to autoregressive language modeling, we develop a top-2\n16$\\times$3.8B MoE model. Our model, with only 6.6B activated parameters,\noutperforms a 7B dense model and matches the performance of a 14B dense model\ntrained on the same data. Extensive evaluations across diverse tasks\ndemonstrate the potential of GRIN to significantly enhance MoE efficacy,\nachieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GRIN: GRadient-INformed MoE\",\"authors\":\"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen\",\"doi\":\"arxiv-2409.12136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Mixture-of-Experts (MoE) models scale more effectively than dense models due\\nto sparse computation through expert routing, selectively activating only a\\nsmall subset of expert modules. However, sparse computation challenges\\ntraditional training practices, as discrete expert routing hinders standard\\nbackpropagation and thus gradient-based optimization, which are the cornerstone\\nof deep learning. To better pursue the scaling power of MoE, we introduce GRIN\\n(GRadient-INformed MoE training), which incorporates sparse gradient estimation\\nfor expert routing and configures model parallelism to avoid token dropping.\\nApplying GRIN to autoregressive language modeling, we develop a top-2\\n16$\\\\times$3.8B MoE model. Our model, with only 6.6B activated parameters,\\noutperforms a 7B dense model and matches the performance of a 14B dense model\\ntrained on the same data. Extensive evaluations across diverse tasks\\ndemonstrate the potential of GRIN to significantly enhance MoE efficacy,\\nachieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mixture-of-Experts (MoE) models scale more effectively than dense models due
to sparse computation through expert routing, selectively activating only a
small subset of expert modules. However, sparse computation challenges
traditional training practices, as discrete expert routing hinders standard
backpropagation and thus gradient-based optimization, which are the cornerstone
of deep learning. To better pursue the scaling power of MoE, we introduce GRIN
(GRadient-INformed MoE training), which incorporates sparse gradient estimation
for expert routing and configures model parallelism to avoid token dropping.
Applying GRIN to autoregressive language modeling, we develop a top-2
16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters,
outperforms a 7B dense model and matches the performance of a 14B dense model
trained on the same data. Extensive evaluations across diverse tasks
demonstrate the potential of GRIN to significantly enhance MoE efficacy,
achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.