哈米特启发的数据高效产品基线Δ-ML在化学领域。

IF 5.5 1区化学 Q2 CHEMISTRY, PHYSICAL

Journal of Chemical Theory and Computation Pub Date : 2025-09-30 DOI:10.1021/acs.jctc.5c00848

V. Diana Rakotonirina, , , Marco Bragato, , , Guido Falk von Rudorff, , and , O. Anatole von Lilienfeld*,

{"title":"哈米特启发的数据高效产品基线Δ-ML在化学领域。","authors":"V. Diana Rakotonirina, , , Marco Bragato, , , Guido Falk von Rudorff, , and , O. Anatole von Lilienfeld*, ","doi":"10.1021/acs.jctc.5c00848","DOIUrl":null,"url":null,"abstract":"Data-hungry machine learning methods have become a new standard to efficiently navigate chemical compound space for molecular and materials design and discovery. Due to the severe scarcity and cost of high-quality experimental or synthetic simulated training data, however, data-acquisition costs can be considerable. Relying on reasonably accurate approximate legacy baseline labels with low computational complexity represents one of the most effective strategies to curb data-needs, e.g. through Δ-, transfer-, or multifidelity learning. A surprisingly effective and data-efficient baseline model is presented in the form of a generic coarse-graining Hammett-inspired product (HIP) Ansatz, generalizing the empirical Hammett equation toward arbitrary systems and properties. Numerical evidence for the applicability of HIP includes solvation free energies of molecules, formation energies of quaternary elpasolite crystals, carbon adsorption energies on heterogeneous catalytic surfaces, HOMO–LUMO gaps of metallorganic complexes, activation energies for SN2 reactions, and catalyst–substrate binding energies in cross-coupling reactions. After calibration on the same training sets, HIP yields an effective baseline for improved Δ-machine learning models with superior data-efficiency when compared to previously introduced specialized domain-specific models.","PeriodicalId":45,"journal":{"name":"Journal of Chemical Theory and Computation","volume":"21 19","pages":"9844–9852"},"PeriodicalIF":5.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hammett-Inspired Product Baseline for Data-Efficient Δ-ML in Chemical Space\",\"authors\":\"V. Diana Rakotonirina, , , Marco Bragato, , , Guido Falk von Rudorff, , and , O. Anatole von Lilienfeld*, \",\"doi\":\"10.1021/acs.jctc.5c00848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data-hungry machine learning methods have become a new standard to efficiently navigate chemical compound space for molecular and materials design and discovery. Due to the severe scarcity and cost of high-quality experimental or synthetic simulated training data, however, data-acquisition costs can be considerable. Relying on reasonably accurate approximate legacy baseline labels with low computational complexity represents one of the most effective strategies to curb data-needs, e.g. through Δ-, transfer-, or multifidelity learning. A surprisingly effective and data-efficient baseline model is presented in the form of a generic coarse-graining Hammett-inspired product (HIP) Ansatz, generalizing the empirical Hammett equation toward arbitrary systems and properties. Numerical evidence for the applicability of HIP includes solvation free energies of molecules, formation energies of quaternary elpasolite crystals, carbon adsorption energies on heterogeneous catalytic surfaces, HOMO–LUMO gaps of metallorganic complexes, activation energies for SN2 reactions, and catalyst–substrate binding energies in cross-coupling reactions. After calibration on the same training sets, HIP yields an effective baseline for improved Δ-machine learning models with superior data-efficiency when compared to previously introduced specialized domain-specific models.\",\"PeriodicalId\":45,\"journal\":{\"name\":\"Journal of Chemical Theory and Computation\",\"volume\":\"21 19\",\"pages\":\"9844–9852\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Theory and Computation\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jctc.5c00848\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Theory and Computation","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jctc.5c00848","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

数据饥渴的机器学习方法已经成为有效导航化合物空间的新标准，用于分子和材料的设计和发现。然而，由于高质量的实验或合成模拟训练数据的严重稀缺和成本，数据采集成本可能相当高。依靠合理准确的近似遗留基线标签，具有较低的计算复杂性，是抑制数据需求的最有效策略之一，例如通过Δ-，转移-或多保真学习。一个惊人的有效和数据效率的基线模型以一般的粗粒度哈米特启发产品（HIP） Ansatz的形式提出，将经验哈米特方程推广到任意系统和性质。HIP适用性的数值证据包括分子的溶剂化自由能、季相斜沸石晶体的形成能、非均相催化表面的碳吸附能、金属有机配合物的HOMO-LUMO间隙、SN2反应的活化能以及交叉偶联反应中催化剂-底物的结合能。在相同的训练集上进行校准后，与之前引入的专门领域特定模型相比，HIP为改进的Δ-machine学习模型提供了有效的基线，具有更高的数据效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Hammett-Inspired Product Baseline for Data-Efficient Δ-ML in Chemical Space

查看原文本刊更多论文

Hammett-Inspired Product Baseline for Data-Efficient Δ-ML in Chemical Space

Data-hungry machine learning methods have become a new standard to efficiently navigate chemical compound space for molecular and materials design and discovery. Due to the severe scarcity and cost of high-quality experimental or synthetic simulated training data, however, data-acquisition costs can be considerable. Relying on reasonably accurate approximate legacy baseline labels with low computational complexity represents one of the most effective strategies to curb data-needs, e.g. through Δ-, transfer-, or multifidelity learning. A surprisingly effective and data-efficient baseline model is presented in the form of a generic coarse-graining Hammett-inspired product (HIP) Ansatz, generalizing the empirical Hammett equation toward arbitrary systems and properties. Numerical evidence for the applicability of HIP includes solvation free energies of molecules, formation energies of quaternary elpasolite crystals, carbon adsorption energies on heterogeneous catalytic surfaces, HOMO–LUMO gaps of metallorganic complexes, activation energies for S_N2 reactions, and catalyst–substrate binding energies in cross-coupling reactions. After calibration on the same training sets, HIP yields an effective baseline for improved Δ-machine learning models with superior data-efficiency when compared to previously introduced specialized domain-specific models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Theory and Computation 化学-物理：原子、分子和化学物理

CiteScore

9.90

自引率

16.40%

发文量

568

审稿时长

1 months

期刊介绍： The Journal of Chemical Theory and Computation invites new and original contributions with the understanding that, if accepted, they will not be published elsewhere. Papers reporting new theories, methodology, and/or important applications in quantum electronic structure, molecular dynamics, and statistical mechanics are appropriate for submission to this Journal. Specific topics include advances in or applications of ab initio quantum mechanics, density functional theory, design and properties of new materials, surface science, Monte Carlo simulations, solvation models, QM/MM calculations, biomolecular structure prediction, and molecular dynamics in the broadest sense including gas-phase dynamics, ab initio dynamics, biomolecular dynamics, and protein folding. The Journal does not consider papers that are straightforward applications of known methods including DFT and molecular dynamics. The Journal favors submissions that include advances in theory or methodology with applications to compelling problems.