Adaptive subspace Bayesian optimization over molecular descriptor libraries for data-efficient chemical design

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-09-01 DOI:10.1039/D5DD00188A

Farshud Sorourifar, Thomas Banker and Joel A. Paulson

{"title":"Adaptive subspace Bayesian optimization over molecular descriptor libraries for data-efficient chemical design","authors":"Farshud Sorourifar, Thomas Banker and Joel A. Paulson","doi":"10.1039/D5DD00188A","DOIUrl":null,"url":null,"abstract":"<p >The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels <em>via</em> simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100 000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2910-2926"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00188a?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00188a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels via simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100 000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.

Abstract Image

查看原文本刊更多论文

基于分子描述符库的自适应子空间贝叶斯优化，用于数据高效的化学设计

发现具有最佳功能特性的分子是跨越能源存储、催化和化学传感等多个领域的核心挑战。然而，由于化学空间的组合大小和通过模拟或湿实验室实验获取性质标签的成本，分子性质优化（MPO）仍然很困难。贝叶斯优化（BO）为这种情况下的样本高效发现提供了一个原则性框架，但其有效性主要取决于用于训练潜在概率代理模型的分子表示的质量。现有的基于指纹、图形、SMILES字符串或学习嵌入的方法由于高维或结构不良的潜在空间，经常在低数据状态下挣扎。在这里，我们引入带有主动识别子空间的分子描述符（MolDAIS），这是一个灵活的分子BO框架，可以自适应地识别大型描述符库中与任务相关的子空间。利用最近BO文献中先前引入的稀疏轴对齐子空间（SAAS）， MolDAIS构建了简洁的高斯过程代理模型，该模型在获取新数据时专注于任务相关特征。除了在基于描述符的MPO中验证这种方法外，我们还引入了两种新的筛选变体，它们在保持预测准确性和物理可解释性的同时显著降低了计算成本。我们证明，在一系列基准测试和实际任务中，包括单目标和多目标优化，MolDAIS始终优于最先进的MPO方法。我们的研究结果表明，MolDAIS可以通过不到100次的性质评估，从超过10万个分子的化学文库中识别出接近最佳的候选分子，这突出了它作为数据稀缺分子发现的实用工具的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量