An open-source family of large encoder-decoder foundation models for chemistry.

IF 6.2 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Communications Chemistry Pub Date : 2025-07-01 DOI:10.1038/s42004-025-01585-0

Eduardo Soares, Emilio Vital Brazil, Victor Shirasuna, Dmitry Zubarev, Renato Cerqueira, Kristin Schmidt

{"title":"An open-source family of large encoder-decoder foundation models for chemistry.","authors":"Eduardo Soares, Emilio Vital Brazil, Victor Shirasuna, Dmitry Zubarev, Renato Cerqueira, Kristin Schmidt","doi":"10.1038/s42004-025-01585-0","DOIUrl":null,"url":null,"abstract":"<p><p>The use of foundation models has extended from natural language processing to molecular modeling. In this context, large-scale pre-training strategies have been applied to chemical language models to enable representation learning across diverse tasks. Here we introduce a family of encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million molecular sequences from PubChem. These models support a range of applications, including property estimation and reaction outcome prediction. We evaluate two model variants across several benchmark datasets and show that they match or exceed existing approaches. We also assess the structure of the learned representations and find that the embedding space supports few-shot learning and separates molecules based on chemically relevant features. This structure appears to result from the decoder-based reconstruction objective used during pre-training. These findings suggest that the proposed models can serve as general-purpose tools for molecular analysis and reasoning with minimal supervision.</p>","PeriodicalId":10529,"journal":{"name":"Communications Chemistry","volume":"8 1","pages":"193"},"PeriodicalIF":6.2000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12216393/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1038/s42004-025-01585-0","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The use of foundation models has extended from natural language processing to molecular modeling. In this context, large-scale pre-training strategies have been applied to chemical language models to enable representation learning across diverse tasks. Here we introduce a family of encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million molecular sequences from PubChem. These models support a range of applications, including property estimation and reaction outcome prediction. We evaluate two model variants across several benchmark datasets and show that they match or exceed existing approaches. We also assess the structure of the learned representations and find that the embedding space supports few-shot learning and separates molecules based on chemically relevant features. This structure appears to result from the decoder-based reconstruction objective used during pre-training. These findings suggest that the proposed models can serve as general-purpose tools for molecular analysis and reasoning with minimal supervision.

查看原文本刊更多论文

一个开源家族的大型编码器-解码器基础模型的化学。

基础模型的使用已经从自然语言处理扩展到分子建模。在此背景下，大规模预训练策略已应用于化学语言模型，以实现跨不同任务的表征学习。在这里，我们介绍了一系列编码器-解码器化学基础模型，这些模型是在PubChem的9100万个分子序列的精选数据集上进行预训练的。这些模型支持一系列的应用，包括性质估计和反应结果预测。我们在几个基准数据集上评估了两个模型变体，并表明它们匹配或超过了现有的方法。我们还评估了学习表征的结构，发现嵌入空间支持少量学习，并基于化学相关特征分离分子。这种结构似乎源于预训练中使用的基于解码器的重建目标。这些发现表明，所提出的模型可以作为分子分析和推理的通用工具，监督最少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications Chemistry Chemistry-General Chemistry

CiteScore

7.70

自引率

1.70%

发文量

146

审稿时长

13 weeks

期刊介绍： Communications Chemistry is an open access journal from Nature Research publishing high-quality research, reviews and commentary in all areas of the chemical sciences. Research papers published by the journal represent significant advances bringing new chemical insight to a specialized area of research. We also aim to provide a community forum for issues of importance to all chemists, regardless of sub-discipline.