Toward Predicting Solubility of Arbitrary Solutes in Arbitrary Solvents. 1: Prediction of Density and Refractive Index Using Machine Learning Algorithms with correlation-group parallel feature analysis

Pub Date : 2023-11-01 DOI:10.26434/chemrxiv-2023-5dgm9
Brian Hu, Jingchen Zhai, Xiguang Qi, Xibing He, Junmei Wang
{"title":"Toward Predicting Solubility of Arbitrary Solutes in Arbitrary Solvents. 1: Prediction of Density and Refractive Index Using Machine Learning Algorithms with correlation-group parallel feature analysis","authors":"Brian Hu, Jingchen Zhai, Xiguang Qi, Xibing He, Junmei Wang","doi":"10.26434/chemrxiv-2023-5dgm9","DOIUrl":null,"url":null,"abstract":"Density and refractive index (nD) are two important properties related to van der Waals energy of a molecule. Thus, accurate prediction of these two properties has a great value in both molecular mechanics force field development, and solvation free energy and solubility prediction of any arbitrary molecules. In this study, we gathered molecule characteristics information of roughly 5,000 organic compounds for density records and 4000 organic compounds for nD values. Subsequently, the distinct GAFF (General AMBER Force Field) descriptors and RDkit descriptors of the compounds were generated and then applied to train various prediction models with a variety of machine learning algorithms for both properties respectively. As a result, both GAFF and RDkit descriptors yielded various robust models with low average percent errors (APE), low root-mean-square errors (RMSE) and high correlation coefficients R-square, while RDkit showed slightly better performance for predicting both properties. We further optimized top models and conducted parallel feature analysis (PFA) to identify specific features in each descriptor which outstandingly contributed to model robustness. The final model RMSE is 0.071 g/cm3 for density prediction and 0.014 for nD prediction, the APE value is as low as 2.845% for density and 0.531% for nD, and R-square is 0.950 for density and 0.954 for nD. Note that the performance of our prediction models for both density and nD significantly outperforms all currently published studies, especially for those with a dataset containing more than 200 records. The successful prediction of the two key molecular properties paves the road towards accurately predicting solubility of an arbitrary solute in an arbitrary solvent, an endeavor not only facilitates pharmaceutical industry to develop better drug candidates, but also increases efficiency regarding overall wet lab work. Key predictors which contribute most to a specific model or model function were identified using both Shapley analysis and correlation-group parallel feature analysis (CG-PFA).","PeriodicalId":0,"journal":{"name":"","volume":"46 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26434/chemrxiv-2023-5dgm9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Density and refractive index (nD) are two important properties related to van der Waals energy of a molecule. Thus, accurate prediction of these two properties has a great value in both molecular mechanics force field development, and solvation free energy and solubility prediction of any arbitrary molecules. In this study, we gathered molecule characteristics information of roughly 5,000 organic compounds for density records and 4000 organic compounds for nD values. Subsequently, the distinct GAFF (General AMBER Force Field) descriptors and RDkit descriptors of the compounds were generated and then applied to train various prediction models with a variety of machine learning algorithms for both properties respectively. As a result, both GAFF and RDkit descriptors yielded various robust models with low average percent errors (APE), low root-mean-square errors (RMSE) and high correlation coefficients R-square, while RDkit showed slightly better performance for predicting both properties. We further optimized top models and conducted parallel feature analysis (PFA) to identify specific features in each descriptor which outstandingly contributed to model robustness. The final model RMSE is 0.071 g/cm3 for density prediction and 0.014 for nD prediction, the APE value is as low as 2.845% for density and 0.531% for nD, and R-square is 0.950 for density and 0.954 for nD. Note that the performance of our prediction models for both density and nD significantly outperforms all currently published studies, especially for those with a dataset containing more than 200 records. The successful prediction of the two key molecular properties paves the road towards accurately predicting solubility of an arbitrary solute in an arbitrary solvent, an endeavor not only facilitates pharmaceutical industry to develop better drug candidates, but also increases efficiency regarding overall wet lab work. Key predictors which contribute most to a specific model or model function were identified using both Shapley analysis and correlation-group parallel feature analysis (CG-PFA).
分享
查看原文
预测任意溶质在任意溶剂中的溶解度。1:用相关群并行特征分析的机器学习算法预测密度和折射率
密度和折射率是与分子的范德华能有关的两个重要性质。因此,准确预测这两种性质对于分子力学力场的发展,以及任意分子的溶剂化自由能和溶解度的预测都具有重要的价值。在本研究中,我们收集了大约5000种有机化合物的分子特征信息,用于密度记录和4000种有机化合物的nD值。随后,生成化合物的不同GAFF (General AMBER Force Field)描述符和RDkit描述符,然后分别针对这两种性质使用各种机器学习算法训练各种预测模型。因此,GAFF和RDkit描述符都产生了各种鲁棒模型,具有低平均百分比误差(APE)、低均方根误差(RMSE)和高相关系数R-square,而RDkit在预测这两种属性方面表现出稍好的性能。我们进一步优化了顶级模型,并进行了并行特征分析(PFA),以识别每个描述符中的特定特征,这些特征对模型的鲁棒性有显著贡献。密度预测的最终模型RMSE为0.071 g/cm3, nD预测的最终模型RMSE为0.014,密度预测的APE值低至2.845%,nD预测的APE值低至0.531%,密度预测的r平方为0.950,nD预测的r平方为0.954。请注意,我们对密度和nD的预测模型的性能明显优于目前发表的所有研究,特别是对于那些包含超过200条记录的数据集。这两个关键分子性质的成功预测为准确预测任意溶质在任意溶剂中的溶解度铺平了道路,这一努力不仅有助于制药行业开发更好的候选药物,而且还提高了整个湿实验室工作的效率。使用Shapley分析和相关群并行特征分析(CG-PFA)确定了对特定模型或模型功能贡献最大的关键预测因子。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信