Large property models: a new generative machine-learning formulation for molecules

IF 3.4 3区化学 Q2 Chemistry

Faraday Discussions Pub Date : 2024-09-27 DOI:10.1039/D4FD00113C

Tianfan Jin, Veerupaksh Singla, Hsuan-Hao Hsu and Brett M. Savoie

{"title":"Large property models: a new generative machine-learning formulation for molecules","authors":"Tianfan Jin, Veerupaksh Singla, Hsuan-Hao Hsu and Brett M. Savoie","doi":"10.1039/D4FD00113C","DOIUrl":null,"url":null,"abstract":"<p >Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.</p>","PeriodicalId":49075,"journal":{"name":"Faraday Discussions","volume":"256 ","pages":" 104-119"},"PeriodicalIF":3.4000,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/fd/d4fd00113c?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Faraday Discussions","FirstCategoryId":"92","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/fd/d4fd00113c","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Chemistry","Score":null,"Total":0}

引用次数: 0

Abstract

Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.

Abstract Image

查看原文本刊更多论文

大型属性模型：一种新的分子生成机器学习公式。

具有特定性质的分子逆向设计的生成模型已经被大肆宣传，但尚未证明在机器学习增强的专家直觉方面取得了重大进展。这种模型的一个主要挑战是，在数据稀缺的情况下，它们在预测具有目标特性的分子方面的准确性有限，而数据稀缺正是人们希望逆模型能够发现的珍贵异常值的典型情况。例如，药物靶点的活性数据或材料的稳定性数据可能只有几十到几百个样本，这不足以从头开始学习准确而合理的一般性质-结构逆映射。我们假设，当在训练期间向模型提供了足够数量的属性时，属性到结构的映射就会变得唯一。如果这个假设是正确的，那么它有几个重要的推论。这意味着，数据稀缺的性质可以完全确定使用一组更容易获得的分子性质。这也意味着在多个属性上训练的生成模型在达到足够的大小后会表现出精确的相变——这一过程类似于在大型语言模型中观察到的过程。为了询问这些行为，我们已经构建了第一个在属性到分子图任务上训练的转换器，我们称之为“大型属性模型”（lpm）。一个关键的因素是在训练过程中用相对基本但丰富的化学性质数据来补充这些模型。本文介绍了大型属性模型范式、模型体系结构和案例研究的动机。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊