GP-MoLFormer：分子生成的基础模型

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-08-18 DOI:10.1039/D5DD00122F

Jerret Ross, Brian Belgodere, Samuel C. Hoffman, Vijil Chenthamarakshan, Jiri Navratil, Youssef Mroueh and Payel Das

{"title":"GP-MoLFormer：分子生成的基础模型","authors":"Jerret Ross, Brian Belgodere, Samuel C. Hoffman, Vijil Chenthamarakshan, Jiri Navratil, Youssef Mroueh and Payel Das","doi":"10.1039/D5DD00122F","DOIUrl":null,"url":null,"abstract":"Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure–property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1b (billion) chemical SMILES. GP-MoLFormer uses a 46.8m parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, while producing molecules with higher diversity demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations, and show that the proposed model excels at yielding molecules containing unique scaffolds while generating at ≈106 to 109 scale.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2684-2696"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00122f?page=search","citationCount":"0","resultStr":"{\"title\":\"GP-MoLFormer: a foundation model for molecular generation\",\"authors\":\"Jerret Ross, Brian Belgodere, Samuel C. Hoffman, Vijil Chenthamarakshan, Jiri Navratil, Youssef Mroueh and Payel Das\",\"doi\":\"10.1039/D5DD00122F\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure–property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1b (billion) chemical SMILES. GP-MoLFormer uses a 46.8m parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, while producing molecules with higher diversity demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations, and show that the proposed model excels at yielding molecules containing unique scaffolds while generating at ≈106 to 109 scale.\",\"PeriodicalId\":72816,\"journal\":{\"name\":\"Digital discovery\",\"volume\":\" 10\",\"pages\":\" 2684-2696\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00122f?page=search\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00122f\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00122f","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

基于转换器的模型在由分子串组成的大型和通用数据集上训练，最近成为成功建模各种结构-性质关系的强大工具。受这一成功的启发，我们将在大规模化学数据集上训练化学语言转换器的范例扩展到本工作中的生成任务。具体来说，我们提出了GP-MoLFormer，这是一种自回归分子串生成器，可以在超过11亿个化学smile上进行训练。GP-MoLFormer采用46.8m参数的变压器解码器模型，采用线性关注和旋转位置编码作为基本架构。在三个不同的任务中，对GP-MoLFormer的效用进行了评估并与现有基线进行了比较：从头生成、支架约束分子修饰和无约束性能指导优化。虽然前两个任务不需要额外的训练，但我们为最后一个任务提出了一种参数有效的微调方法，该方法使用属性有序分子对作为输入。我们称这种新方法为成对调优。我们的研究结果表明，GP-MoLFormer在所有三种任务中都表现得更好或与基线相当，同时产生的分子具有更高的多样性，这表明它在各种分子生成任务中具有普遍的实用性。我们进一步报道了GP-MoLFormer世代中训练数据的强记忆性，这在化学语言模型中迄今尚未得到探索。我们的分析表明，训练数据的记忆性和代际新颖性受到训练数据质量和规模的影响；训练数据中的重复偏差可以以降低新颖性为代价来增强记忆。我们进一步建立了推理计算和新颖性之间的比例律，并表明所提出的模型在≈106到109的尺度上产生含有独特支架的分子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

GP-MoLFormer: a foundation model for molecular generation

查看原文本刊更多论文

GP-MoLFormer: a foundation model for molecular generation

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure–property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1b (billion) chemical SMILES. GP-MoLFormer uses a 46.8m parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, while producing molecules with higher diversity demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations, and show that the proposed model excels at yielding molecules containing unique scaffolds while generating at ≈10⁶ to 10⁹ scale.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量