On the design of optimal computer experiments to model solvent effects on reaction kinetics†

IF 3.2 3区工程技术 Q2 CHEMISTRY, PHYSICAL

Molecular Systems Design & Engineering Pub Date : 2024-09-06 DOI:10.1039/D4ME00074A

Lingfeng Gui, Alan Armstrong, Amparo Galindo, Fareed Bhasha Sayyed, Stanley P. Kolis and Claire S. Adjiman

{"title":"On the design of optimal computer experiments to model solvent effects on reaction kinetics†","authors":"Lingfeng Gui, Alan Armstrong, Amparo Galindo, Fareed Bhasha Sayyed, Stanley P. Kolis and Claire S. Adjiman","doi":"10.1039/D4ME00074A","DOIUrl":null,"url":null,"abstract":"Developing an accurate predictive model of solvent effects on reaction kinetics is a challenging task, yet it can play an important role in process development. While first-principles or machine learning models are often compute- or data-intensive, simple surrogate models, such as multivariate linear or quadratic regression models, are useful when computational resources and data are scarce. The judicious choice of a small set of training data, i.e., a set of solvents in which quantum mechanical (QM) calculations of liquid-phase rate constants are to be performed, is critical to obtaining a reliable model. This is, however, made especially challenging by the highly irregular shape of the discrete space of possible experiments (solvent choices). In this work, we demonstrate that when choosing a set of computer experiments to generate training data, the D-optimality criterion value of the chosen set correlates well with the likelihood of achieving good model performance. With the Menshutkin reaction of pyridine and phenacyl bromide as a case study, this finding is further verified via the evaluation of the surrogate models regressed using D-optimal solvent sets generated from four distinct selection spaces. We also find that incorporating quadratic terms in the surrogate model and choosing the D-optimal solvent set from a selection space similar to the test set can significantly improve the accuracy of reaction rate constant predictions while using a small training dataset. Our approach holds promise for the use of statistical optimality criteria for other types of computer experiments, supporting the construction of surrogate models with reduced resource and data requirements.","PeriodicalId":91,"journal":{"name":"Molecular Systems Design & Engineering","volume":" 12","pages":" 1254-1274"},"PeriodicalIF":3.2000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/me/d4me00074a?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Systems Design & Engineering","FirstCategoryId":"5","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/me/d4me00074a","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Developing an accurate predictive model of solvent effects on reaction kinetics is a challenging task, yet it can play an important role in process development. While first-principles or machine learning models are often compute- or data-intensive, simple surrogate models, such as multivariate linear or quadratic regression models, are useful when computational resources and data are scarce. The judicious choice of a small set of training data, i.e., a set of solvents in which quantum mechanical (QM) calculations of liquid-phase rate constants are to be performed, is critical to obtaining a reliable model. This is, however, made especially challenging by the highly irregular shape of the discrete space of possible experiments (solvent choices). In this work, we demonstrate that when choosing a set of computer experiments to generate training data, the D-optimality criterion value of the chosen set correlates well with the likelihood of achieving good model performance. With the Menshutkin reaction of pyridine and phenacyl bromide as a case study, this finding is further verified via the evaluation of the surrogate models regressed using D-optimal solvent sets generated from four distinct selection spaces. We also find that incorporating quadratic terms in the surrogate model and choosing the D-optimal solvent set from a selection space similar to the test set can significantly improve the accuracy of reaction rate constant predictions while using a small training dataset. Our approach holds promise for the use of statistical optimality criteria for other types of computer experiments, supporting the construction of surrogate models with reduced resource and data requirements.

Abstract Image

查看原文本刊更多论文

关于设计最佳计算机实验来模拟溶剂对反应动力学的影响

就溶剂对反应动力学的影响建立精确的预测模型是一项极具挑战性的任务，但却能在工艺开发中发挥重要作用。第一原理或机器学习模型通常是计算或数据密集型的，而简单的代用模型，如多元线性或二次回归模型，在计算资源和数据稀缺的情况下非常有用。要获得可靠的模型，明智地选择一小组训练数据（即一组溶剂，在其中对液相速率常数进行量子力学（QM）计算）至关重要。然而，由于可能的实验（溶剂选择）的离散空间形状极不规则，这尤其具有挑战性。在这项工作中，我们证明了在选择一组计算机实验来生成训练数据时，所选实验组的 D-optimality 标准值与获得良好模型性能的可能性密切相关。以吡啶和苯酰溴的 Menschutkin 反应为例，通过评估使用从四个不同选择空间生成的 D-最优溶剂集回归的代用模型，进一步验证了这一发现。我们还发现，在代用模型中加入二次项，并从与测试集类似的选择空间中选择 D 最佳溶剂集，可以显著提高反应速率常数预测的准确性，同时只需使用少量的训练数据集。我们的方法有望在其他类型的计算机实验中使用统计最优性标准，支持在减少资源和数据需求的情况下构建代用模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Systems Design & Engineering Engineering-Biomedical Engineering

CiteScore

6.40

自引率

2.80%

发文量

144

期刊介绍： Molecular Systems Design & Engineering provides a hub for cutting-edge research into how understanding of molecular properties, behaviour and interactions can be used to design and assemble better materials, systems, and processes to achieve specific functions. These may have applications of technological significance and help address global challenges.