FORGE: Pre-Training Open Foundation Models for Science

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-11-11 DOI:10.1145/3581784.3613215

Junqi Yin, Sajal Dash, Feiyi Wang, M. Shankar

{"title":"FORGE: Pre-Training Open Foundation Models for Science","authors":"Junqi Yin, Sajal Dash, Feiyi Wang, M. Shankar","doi":"10.1145/3581784.3613215","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) are poised to revolutionize the way we conduct scientific research. However, both model complexity and pre-training cost are impeding effective adoption for the wider science community. Identifying suitable scientific use cases, finding the optimal balance between model and data sizes, and scaling up model training are among the most pressing issues that need to be addressed. In this study, we provide practical solutions for building and using LLM-based foundation models targeting scientific research use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first Exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles, with performance either on par or superior to other state-of-the-art comparable models. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to take advantage of LLMs for scientific discovery.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"65 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3581784.3613215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) are poised to revolutionize the way we conduct scientific research. However, both model complexity and pre-training cost are impeding effective adoption for the wider science community. Identifying suitable scientific use cases, finding the optimal balance between model and data sizes, and scaling up model training are among the most pressing issues that need to be addressed. In this study, we provide practical solutions for building and using LLM-based foundation models targeting scientific research use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first Exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles, with performance either on par or superior to other state-of-the-art comparable models. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to take advantage of LLMs for scientific discovery.

查看原文本刊更多论文

FORGE：预培训开放式科学基础模型

大型语言模型（LLM）有望彻底改变我们开展科学研究的方式。然而，模型的复杂性和预训练成本阻碍了更广泛科学界的有效采用。确定合适的科学用例、找到模型与数据规模之间的最佳平衡点以及扩大模型训练规模，都是亟待解决的问题。在本研究中，我们针对科研用例，为构建和使用基于 LLM 的基础模型提供了实用的解决方案。我们对 LLM 在科学研究中的有效性进行了端到端的检验，包括其在第一台 Exascale 超级计算机 Frontier 上的扩展行为和计算要求。我们还开发了一套名为 FORGE 的开放式基础模型，并将其发布给科学界，该模型使用来自 2 亿多篇科学文章的 257B 标记，拥有多达 26B 个参数，其性能与其他最先进的同类模型相当或更胜一筹。我们已经证明了 FORGE 在科学下游任务中的应用和有效性。我们的研究确立了可应用于各个领域的最佳实践，以利用 LLM 进行科学发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量