Intermediate knowledge enhanced the performance of the amide coupling yield prediction model

IF 7.6 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Chemical Science Pub Date : 2025-06-05 DOI:10.1039/d5sc03364k

Chonghuan Zhang, Qianghua Lin, Chenxi Yang, Yaxian Kong, Zhunzhun Yu, Kuangbiao Liao

{"title":"Intermediate knowledge enhanced the performance of the amide coupling yield prediction model","authors":"Chonghuan Zhang, Qianghua Lin, Chenxi Yang, Yaxian Kong, Zhunzhun Yu, Kuangbiao Liao","doi":"10.1039/d5sc03364k","DOIUrl":null,"url":null,"abstract":"Amide coupling is an important reaction widely applied in medicinal chemistry. However, condition recommendation remains a challenging issue due to the broad condition space. Recently, accurate condition recommendation via machine learning has emerged as a novel and efficient method to find suitable conditions to achieve the desired transformations. Nonetheless, accurately predicting yields is challenging due to the complex relationships involved. Herein, we present our strategy to address this problem. Two steps were taken to ensure the quality of the dataset. First, we selected a diverse and representative set of substrates to capture a broad spectrum of substrate structures and reaction conditions using an unbiased machine-based sampling approach. Second, experiments were conducted using our in-house high-throughput experimentation (HTE) platform to minimize the influence of human factors. Additionally, we proposed an intermediate knowledge-embedded strategy to enhance the model's robustness. The performance of the model was first evaluated at three different levels—random split, partial substrate novelty, and full substrate novelty. All model metrics in these cases improved dramatically, achieving an R2 of 0.89, MAE of 6.1%, and RMSE of 8.0% in the full substrate novelty test dataset. Moreover, the generalization of our strategy was assessed using external datasets from reported literature, delivering an R2 of 0.71, MAE of 7%, and RMSE of 10%. Meanwhile, the model could recommend suitable conditions for some reactions to elevate the reaction yields. Besides, the model was able to identify which reaction in a reaction pair with a reactivity cliff had a higher yield. In summary, our research demonstrated the feasibility of achieving accurate yield predictions through the combination of HTE and embedding intermediate knowledge into the model. This approach also has the potential to facilitate other related machine learning tasks.","PeriodicalId":9909,"journal":{"name":"Chemical Science","volume":"9 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Science","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d5sc03364k","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Amide coupling is an important reaction widely applied in medicinal chemistry. However, condition recommendation remains a challenging issue due to the broad condition space. Recently, accurate condition recommendation via machine learning has emerged as a novel and efficient method to find suitable conditions to achieve the desired transformations. Nonetheless, accurately predicting yields is challenging due to the complex relationships involved. Herein, we present our strategy to address this problem. Two steps were taken to ensure the quality of the dataset. First, we selected a diverse and representative set of substrates to capture a broad spectrum of substrate structures and reaction conditions using an unbiased machine-based sampling approach. Second, experiments were conducted using our in-house high-throughput experimentation (HTE) platform to minimize the influence of human factors. Additionally, we proposed an intermediate knowledge-embedded strategy to enhance the model's robustness. The performance of the model was first evaluated at three different levels—random split, partial substrate novelty, and full substrate novelty. All model metrics in these cases improved dramatically, achieving an R² of 0.89, MAE of 6.1%, and RMSE of 8.0% in the full substrate novelty test dataset. Moreover, the generalization of our strategy was assessed using external datasets from reported literature, delivering an R² of 0.71, MAE of 7%, and RMSE of 10%. Meanwhile, the model could recommend suitable conditions for some reactions to elevate the reaction yields. Besides, the model was able to identify which reaction in a reaction pair with a reactivity cliff had a higher yield. In summary, our research demonstrated the feasibility of achieving accurate yield predictions through the combination of HTE and embedding intermediate knowledge into the model. This approach also has the potential to facilitate other related machine learning tasks.

Abstract Image

查看原文本刊更多论文

中间知识增强了酰胺偶联产率预测模型的性能

酰胺偶联是药物化学中广泛应用的重要反应。然而，由于工况空间广阔，工况推荐仍然是一个具有挑战性的问题。最近，通过机器学习进行准确的状态推荐已经成为一种新的有效方法，可以找到合适的条件来实现所需的转换。然而，由于所涉及的复杂关系，准确预测收益率具有挑战性。在此，我们提出解决这一问题的策略。为了保证数据集的质量，我们采取了两个步骤。首先，我们选择了一组多样且具有代表性的底物，使用无偏机器采样方法捕获了广泛的底物结构和反应条件。其次，使用我们内部的高通量实验（HTE）平台进行实验，以尽量减少人为因素的影响。此外，我们提出了一种中间知识嵌入策略来增强模型的鲁棒性。该模型的性能首先在三个不同的水平上进行了评估——随机分裂、部分基底新颖性和完全基底新颖性。在这些情况下，所有模型指标都得到了显著改善，在完整的基质新颖性测试数据集中实现了R2为0.89，MAE为6.1%，RMSE为8.0%。此外，我们的策略的泛化使用来自文献报道的外部数据集进行评估，得出R2为0.71，MAE为7%，RMSE为10%。同时，该模型还可以为某些反应推荐适宜的反应条件，以提高反应收率。此外，该模型还能识别出具有反应性悬崖的反应对中哪个反应的产率更高。综上所述，我们的研究表明，通过将HTE与中间知识嵌入模型相结合，实现准确产量预测的可行性。这种方法也有可能促进其他相关的机器学习任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chemical Science CHEMISTRY, MULTIDISCIPLINARY-

CiteScore

14.40

自引率

4.80%

发文量

1352

审稿时长

2.1 months

期刊介绍： Chemical Science is a journal that encompasses various disciplines within the chemical sciences. Its scope includes publishing ground-breaking research with significant implications for its respective field, as well as appealing to a wider audience in related areas. To be considered for publication, articles must showcase innovative and original advances in their field of study and be presented in a manner that is understandable to scientists from diverse backgrounds. However, the journal generally does not publish highly specialized research.