Adaptive subspace Bayesian optimization over molecular descriptor libraries for data-efficient chemical design

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Farshud Sorourifar, Thomas Banker and Joel A. Paulson
{"title":"Adaptive subspace Bayesian optimization over molecular descriptor libraries for data-efficient chemical design","authors":"Farshud Sorourifar, Thomas Banker and Joel A. Paulson","doi":"10.1039/D5DD00188A","DOIUrl":null,"url":null,"abstract":"<p >The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels <em>via</em> simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100 000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2910-2926"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00188a?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00188a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels via simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100 000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.

Abstract Image

基于分子描述符库的自适应子空间贝叶斯优化,用于数据高效的化学设计
发现具有最佳功能特性的分子是跨越能源存储、催化和化学传感等多个领域的核心挑战。然而,由于化学空间的组合大小和通过模拟或湿实验室实验获取性质标签的成本,分子性质优化(MPO)仍然很困难。贝叶斯优化(BO)为这种情况下的样本高效发现提供了一个原则性框架,但其有效性主要取决于用于训练潜在概率代理模型的分子表示的质量。现有的基于指纹、图形、SMILES字符串或学习嵌入的方法由于高维或结构不良的潜在空间,经常在低数据状态下挣扎。在这里,我们引入带有主动识别子空间的分子描述符(MolDAIS),这是一个灵活的分子BO框架,可以自适应地识别大型描述符库中与任务相关的子空间。利用最近BO文献中先前引入的稀疏轴对齐子空间(SAAS), MolDAIS构建了简洁的高斯过程代理模型,该模型在获取新数据时专注于任务相关特征。除了在基于描述符的MPO中验证这种方法外,我们还引入了两种新的筛选变体,它们在保持预测准确性和物理可解释性的同时显著降低了计算成本。我们证明,在一系列基准测试和实际任务中,包括单目标和多目标优化,MolDAIS始终优于最先进的MPO方法。我们的研究结果表明,MolDAIS可以通过不到100次的性质评估,从超过10万个分子的化学文库中识别出接近最佳的候选分子,这突出了它作为数据稀缺分子发现的实用工具的前景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信