评估不对称催化的预测准确性：局部反应空间的机器学习视角

IF 13.1 1区化学 Q1 CHEMISTRY, PHYSICAL

ACS Catalysis Pub Date : 2025-03-31 DOI:10.1021/acscatal.5c0105110.1021/acscatal.5c01051

Isaiah O. Betinol, Aleksandra Demchenko and Jolene P. Reid*,

{"title":"评估不对称催化的预测准确性：局部反应空间的机器学习视角","authors":"Isaiah O. Betinol, Aleksandra Demchenko and Jolene P. Reid*, ","doi":"10.1021/acscatal.5c0105110.1021/acscatal.5c01051","DOIUrl":null,"url":null,"abstract":"<p >Machine learning (ML) models are increasingly being employed in asymmetric catalysis to predict reaction outcomes and optimize enantioselective processes. Despite the trend of expanding data set sizes to improve model performance, asymmetric catalysis presents unique challenges, including the difficulty of acquiring high-quality experimental data and the often-limited availability of structurally diverse examples. Consequently, rational data set design requires the practitioner to choose whether to collect data that maximizes diversity in the training set or data that maximizes representation around a target prediction. A key challenge in these studies is understanding the role of local reaction space─specifically, how much predictive accuracy is driven by nearest neighbors (structurally and electronically similar data points) and the next-nearest neighbors? This study investigates the predictive power of ML models trained with varying levels of local representation in the reaction space. We provide a framework, a radius-based random forest (RaRF) algorithm, to systematically probe the effects of including diverse reactions dissimilar to a target prediction. We show that when the training set is representative of the target reaction, the gains from increasing data set diversity are modest─typically less than 0.1 kcal/mol in predictive error─and increasing to only 0.5 kcal/mol for extrapolative tests, highlighting the need for targeted data set design. Furthermore, these findings hold even for complex architectures and features. Finally, we demonstrate that a targeted, neighborhood-oriented strategy greatly accelerates the identification of predictive models compared to diversity-driven approaches.</p>","PeriodicalId":9,"journal":{"name":"ACS Catalysis ","volume":"15 8","pages":"6067–6077 6067–6077"},"PeriodicalIF":13.1000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Predictive Accuracy in Asymmetric Catalysis: A Machine Learning Perspective on Local Reaction Space\",\"authors\":\"Isaiah O. Betinol, Aleksandra Demchenko and Jolene P. Reid*, \",\"doi\":\"10.1021/acscatal.5c0105110.1021/acscatal.5c01051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Machine learning (ML) models are increasingly being employed in asymmetric catalysis to predict reaction outcomes and optimize enantioselective processes. Despite the trend of expanding data set sizes to improve model performance, asymmetric catalysis presents unique challenges, including the difficulty of acquiring high-quality experimental data and the often-limited availability of structurally diverse examples. Consequently, rational data set design requires the practitioner to choose whether to collect data that maximizes diversity in the training set or data that maximizes representation around a target prediction. A key challenge in these studies is understanding the role of local reaction space─specifically, how much predictive accuracy is driven by nearest neighbors (structurally and electronically similar data points) and the next-nearest neighbors? This study investigates the predictive power of ML models trained with varying levels of local representation in the reaction space. We provide a framework, a radius-based random forest (RaRF) algorithm, to systematically probe the effects of including diverse reactions dissimilar to a target prediction. We show that when the training set is representative of the target reaction, the gains from increasing data set diversity are modest─typically less than 0.1 kcal/mol in predictive error─and increasing to only 0.5 kcal/mol for extrapolative tests, highlighting the need for targeted data set design. Furthermore, these findings hold even for complex architectures and features. Finally, we demonstrate that a targeted, neighborhood-oriented strategy greatly accelerates the identification of predictive models compared to diversity-driven approaches.</p>\",\"PeriodicalId\":9,\"journal\":{\"name\":\"ACS Catalysis \",\"volume\":\"15 8\",\"pages\":\"6067–6077 6067–6077\"},\"PeriodicalIF\":13.1000,\"publicationDate\":\"2025-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Catalysis \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acscatal.5c01051\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Catalysis ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acscatal.5c01051","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）模型越来越多地被用于不对称催化反应，以预测反应结果并优化对映选择性过程。尽管有扩大数据集规模以提高模型性能的趋势，但不对称催化仍面临着独特的挑战，包括难以获得高质量的实验数据，以及结构多样化实例的可用性往往有限。因此，合理的数据集设计要求实践者选择是收集能使训练集中的多样性最大化的数据，还是收集能使目标预测的代表性最大化的数据。这些研究中的一个关键挑战是理解局部反应空间的作用--具体来说，近邻（结构和电子相似的数据点）和次近邻在多大程度上影响了预测准确性？本研究调查了在反应空间中使用不同程度的局部表示所训练的 ML 模型的预测能力。我们提供了一个框架，即基于半径的随机森林（RaRF）算法，用于系统地探究将与目标预测不同的各种反应纳入模型的效果。我们的研究表明，当训练集具有目标反应的代表性时，增加数据集多样性所带来的收益并不大--预测误差通常小于 0.1 kcal/mol，而在推断测试中仅增加到 0.5 kcal/mol，这突出表明了有针对性地设计数据集的必要性。此外，这些发现甚至适用于复杂的架构和特征。最后，我们证明，与多样性驱动的方法相比，有针对性的邻域导向策略大大加快了预测模型的识别速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluating Predictive Accuracy in Asymmetric Catalysis: A Machine Learning Perspective on Local Reaction Space

查看原文本刊更多论文

Evaluating Predictive Accuracy in Asymmetric Catalysis: A Machine Learning Perspective on Local Reaction Space

Machine learning (ML) models are increasingly being employed in asymmetric catalysis to predict reaction outcomes and optimize enantioselective processes. Despite the trend of expanding data set sizes to improve model performance, asymmetric catalysis presents unique challenges, including the difficulty of acquiring high-quality experimental data and the often-limited availability of structurally diverse examples. Consequently, rational data set design requires the practitioner to choose whether to collect data that maximizes diversity in the training set or data that maximizes representation around a target prediction. A key challenge in these studies is understanding the role of local reaction space─specifically, how much predictive accuracy is driven by nearest neighbors (structurally and electronically similar data points) and the next-nearest neighbors? This study investigates the predictive power of ML models trained with varying levels of local representation in the reaction space. We provide a framework, a radius-based random forest (RaRF) algorithm, to systematically probe the effects of including diverse reactions dissimilar to a target prediction. We show that when the training set is representative of the target reaction, the gains from increasing data set diversity are modest─typically less than 0.1 kcal/mol in predictive error─and increasing to only 0.5 kcal/mol for extrapolative tests, highlighting the need for targeted data set design. Furthermore, these findings hold even for complex architectures and features. Finally, we demonstrate that a targeted, neighborhood-oriented strategy greatly accelerates the identification of predictive models compared to diversity-driven approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Catalysis CHEMISTRY, PHYSICAL-

CiteScore

20.80

自引率

6.20%

发文量

1253

审稿时长

1.5 months

期刊介绍： ACS Catalysis is an esteemed journal that publishes original research in the fields of heterogeneous catalysis, molecular catalysis, and biocatalysis. It offers broad coverage across diverse areas such as life sciences, organometallics and synthesis, photochemistry and electrochemistry, drug discovery and synthesis, materials science, environmental protection, polymer discovery and synthesis, and energy and fuels. The scope of the journal is to showcase innovative work in various aspects of catalysis. This includes new reactions and novel synthetic approaches utilizing known catalysts, the discovery or modification of new catalysts, elucidation of catalytic mechanisms through cutting-edge investigations, practical enhancements of existing processes, as well as conceptual advances in the field. Contributions to ACS Catalysis can encompass both experimental and theoretical research focused on catalytic molecules, macromolecules, and materials that exhibit catalytic turnover.