Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility
{"title":"Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility","authors":"Wael A. Mahdi , Ahmad J. Obaidullah","doi":"10.1016/j.chemolab.2024.105219","DOIUrl":null,"url":null,"abstract":"<div><p>In this study, we develop predictive models for three target variables, denoted as <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, and <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>, the PAR model outperforms others with an R<sup>2</sup> score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R<sup>2</sup> of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span>. The PR model performs on <span><math><mrow><msub><mi>δ</mi><mi>d</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span>, the GPR model provides the best performance, achieving an R<sup>2</sup> score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with an R<sup>2</sup> of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>p</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span>, the GPR model again shows superior performance with an R<sup>2</sup> score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts <span><math><mrow><msub><mi>δ</mi><mi>h</mi></msub></mrow></math></span> with a R<sup>2</sup> of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately predicting the specified outputs, demonstrating significant potential for application in relevant predictive modeling tasks.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105219"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016974392400159X","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we develop predictive models for three target variables, denoted as , , and using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for , the PAR model outperforms others with an R2 score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R2 of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for . The PR model performs on with an R2 of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For , the GPR model provides the best performance, achieving an R2 score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on with an R2 of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts with a R2 of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for , the GPR model again shows superior performance with an R2 score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts with a R2 of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts with a R2 of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately predicting the specified outputs, demonstrating significant potential for application in relevant predictive modeling tasks.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.