Machine learning enables the prediction of amide bond synthesis based on small datasets

IF 13.5 2区化学 Q1 CHEMISTRY, PHYSICAL

物理化学学报 Pub Date : 2025-02-01 DOI:10.3866/PKU.WHXB202309041

Xinghai Li , Zhisen Wu , Lijing Zhang, Shengyang Tao

{"title":"Machine learning enables the prediction of amide bond synthesis based on small datasets","authors":"Xinghai Li , Zhisen Wu , Lijing Zhang, Shengyang Tao","doi":"10.3866/PKU.WHXB202309041","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) is progressively revealing notable advantages in chemical synthesis. However, the limited output of experimental data from traditional methods poses a bottleneck, impeding the widespread adoption of machine learning. Data from literature often leads to overly optimistic predictions, and obtaining thousands of experimental data points through experiments remains a substantial challenge. Using a small dataset of experimental data, we illustrated that machine learning algorithms can reliably predict the conversion rate of amide bond synthesis. We gathered hundreds of experimental data points for 9 aromatic amines and 12 organic acids using various coupling reagents and solvents in a 96-well plate high-throughput experimental setup. Subsequently, we derived 76 feature molecular descriptors from quantum chemical calculations and utilized them as inputs for training the machine learning model. Despite the inherent limitation of low data volume, the random forest algorithm demonstrated outstanding predictive performance (<em>R</em><sup>2</sup> > 0.95). Through comprehensive analysis of the reaction process employing importance analysis, shapley additive explanations (SHAP), and accumulated local effects (ALE) methods, we delved into the important factors influencing the reaction conversion rate. In predicting the conversion rate of unknown aromatic amine molecules, we discovered that incorporating a small amount of unknown molecule-related reaction data into the training set effectively enhances the model's predictive performance, even with a small dataset. By comparing models trained on different molecular descriptors such as density functional theory (DFT) and one-hot encoding, we validated the efficacy of adjusting the training set to improve prediction results. This study utilized a multitude of chemically meaningful feature descriptors and achieved more effective prediction results through multidimensional data analysis, offering valuable insights for machine learning-assisted chemical synthesis research in small datasets. In the near future, machine learning is poised to drive the intelligent development of organic chemistry.</div></div>","PeriodicalId":6964,"journal":{"name":"物理化学学报","volume":"41 2","pages":"Article 100010"},"PeriodicalIF":13.5000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"物理化学学报","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1000681824000109","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) is progressively revealing notable advantages in chemical synthesis. However, the limited output of experimental data from traditional methods poses a bottleneck, impeding the widespread adoption of machine learning. Data from literature often leads to overly optimistic predictions, and obtaining thousands of experimental data points through experiments remains a substantial challenge. Using a small dataset of experimental data, we illustrated that machine learning algorithms can reliably predict the conversion rate of amide bond synthesis. We gathered hundreds of experimental data points for 9 aromatic amines and 12 organic acids using various coupling reagents and solvents in a 96-well plate high-throughput experimental setup. Subsequently, we derived 76 feature molecular descriptors from quantum chemical calculations and utilized them as inputs for training the machine learning model. Despite the inherent limitation of low data volume, the random forest algorithm demonstrated outstanding predictive performance (R² > 0.95). Through comprehensive analysis of the reaction process employing importance analysis, shapley additive explanations (SHAP), and accumulated local effects (ALE) methods, we delved into the important factors influencing the reaction conversion rate. In predicting the conversion rate of unknown aromatic amine molecules, we discovered that incorporating a small amount of unknown molecule-related reaction data into the training set effectively enhances the model's predictive performance, even with a small dataset. By comparing models trained on different molecular descriptors such as density functional theory (DFT) and one-hot encoding, we validated the efficacy of adjusting the training set to improve prediction results. This study utilized a multitude of chemically meaningful feature descriptors and achieved more effective prediction results through multidimensional data analysis, offering valuable insights for machine learning-assisted chemical synthesis research in small datasets. In the near future, machine learning is poised to drive the intelligent development of organic chemistry.

Abstract Image

查看原文本刊更多论文

机器学习能够基于小数据集预测酰胺键合成

机器学习（ML）在化学合成中逐渐显示出显著的优势。然而，传统方法有限的实验数据输出构成了瓶颈，阻碍了机器学习的广泛采用。来自文献的数据往往导致过于乐观的预测，通过实验获得数千个实验数据点仍然是一个巨大的挑战。使用一个小的实验数据集，我们说明了机器学习算法可以可靠地预测酰胺键合成的转化率。我们在96孔板高通量实验装置中使用各种偶联试剂和溶剂收集了9种芳香胺和12种有机酸的数百个实验数据点。随后，我们从量子化学计算中导出了76个特征分子描述符，并将其用作训练机器学习模型的输入。尽管存在数据量小的固有限制，随机森林算法仍表现出出色的预测性能(R2 >；0.95)。采用重要性分析法、shapley加性解释法（SHAP）和累积局部效应法（ALE）对反应过程进行综合分析，探讨了影响反应转化率的重要因素。在预测未知芳香胺分子的转化率时，我们发现将少量未知分子相关的反应数据纳入训练集可以有效地提高模型的预测性能，即使数据集很小。通过比较密度泛函理论（DFT）和单热编码等不同分子描述符训练的模型，验证了调整训练集以提高预测结果的有效性。本研究利用了大量具有化学意义的特征描述符，并通过多维数据分析获得了更有效的预测结果，为小数据集的机器学习辅助化学合成研究提供了有价值的见解。在不久的将来，机器学习将推动有机化学的智能化发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊