设计有效的稀疏专家模型

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00171

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus

{"title":"设计有效的稀疏专家模型","authors":"Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus","doi":"10.1109/IPDPSW55747.2022.00171","DOIUrl":null,"url":null,"abstract":"Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":"{\"title\":\"Designing Effective Sparse Expert Models\",\"authors\":\"Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, J. Dean, Noam M. Shazeer, W. Fedus\",\"doi\":\"10.1109/IPDPSW55747.2022.00171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).\",\"PeriodicalId\":286968,\"journal\":{\"name\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"228 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"58\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW55747.2022.00171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

摘要

规模为自然语言处理开辟了新的领域——但代价高昂。作为回应，专家混合(MoE)和开关变压器被提议作为更大、更有能力的语言模型的节能途径。但是，在广泛的自然语言任务中推进最先进的技术一直受到训练不稳定性和微调过程中质量不确定的阻碍。我们的工作集中在这些问题上，并作为设计指南。最后，我们将稀疏模型缩放到269B参数，其计算成本与32B密集编码器-解码器变压器(稳定和可转移的专家混合或ST-MoE-32B)相当。第一次，稀疏模型在迁移学习中实现了最先进的性能，跨越了多种任务，包括推理(SuperGLUE, ARC Easy, ARC Challenge)，总结(XSum, CNN-DM)，闭书问答(WebQA，自然问题)和对抗构建任务(Winogrande, ANLI R3)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing Effective Sparse Expert Models

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量