Large property models: a new generative machine-learning formulation for molecules

IF 3.4 3区 化学 Q2 Chemistry
Tianfan Jin, Veerupaksh Singla, Hsuan-Hao Hsu and Brett M. Savoie
{"title":"Large property models: a new generative machine-learning formulation for molecules","authors":"Tianfan Jin, Veerupaksh Singla, Hsuan-Hao Hsu and Brett M. Savoie","doi":"10.1039/D4FD00113C","DOIUrl":null,"url":null,"abstract":"<p >Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.</p>","PeriodicalId":49075,"journal":{"name":"Faraday Discussions","volume":"256 ","pages":" 104-119"},"PeriodicalIF":3.4000,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/fd/d4fd00113c?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Faraday Discussions","FirstCategoryId":"92","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/fd/d4fd00113c","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Chemistry","Score":null,"Total":0}
引用次数: 0

Abstract

Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.

Abstract Image

大型属性模型:一种新的分子生成机器学习公式。
具有特定性质的分子逆向设计的生成模型已经被大肆宣传,但尚未证明在机器学习增强的专家直觉方面取得了重大进展。这种模型的一个主要挑战是,在数据稀缺的情况下,它们在预测具有目标特性的分子方面的准确性有限,而数据稀缺正是人们希望逆模型能够发现的珍贵异常值的典型情况。例如,药物靶点的活性数据或材料的稳定性数据可能只有几十到几百个样本,这不足以从头开始学习准确而合理的一般性质-结构逆映射。我们假设,当在训练期间向模型提供了足够数量的属性时,属性到结构的映射就会变得唯一。如果这个假设是正确的,那么它有几个重要的推论。这意味着,数据稀缺的性质可以完全确定使用一组更容易获得的分子性质。这也意味着在多个属性上训练的生成模型在达到足够的大小后会表现出精确的相变——这一过程类似于在大型语言模型中观察到的过程。为了询问这些行为,我们已经构建了第一个在属性到分子图任务上训练的转换器,我们称之为“大型属性模型”(lpm)。一个关键的因素是在训练过程中用相对基本但丰富的化学性质数据来补充这些模型。本文介绍了大型属性模型范式、模型体系结构和案例研究的动机。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Faraday Discussions
Faraday Discussions CHEMISTRY, PHYSICAL-
CiteScore
4.90
自引率
0.00%
发文量
259
审稿时长
2.8 months
期刊介绍: Discussion summary and research papers from discussion meetings that focus on rapidly developing areas of physical chemistry and its interfaces
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信