Improved QSAR methods for predicting drug properties utilizing topological indices and machine learning models

IF 1.8 4区 物理与天体物理 Q4 CHEMISTRY, PHYSICAL
Muhammad Shoaib Sardar, Muhammad Shahid Iqbal, Muhammad Mudassar Hassan, Changjiang Bu, Sharafat Hussain
{"title":"Improved QSAR methods for predicting drug properties utilizing topological indices and machine learning models","authors":"Muhammad Shoaib Sardar,&nbsp;Muhammad Shahid Iqbal,&nbsp;Muhammad Mudassar Hassan,&nbsp;Changjiang Bu,&nbsp;Sharafat Hussain","doi":"10.1140/epje/s10189-025-00491-6","DOIUrl":null,"url":null,"abstract":"<p>This research investigates the anticipated physicochemical and topological properties of compounds such as drug complexity (C), molecular weight (MW), and topological polar surface area (TPSA) using quantitative structure–activity relationship (QSAR) analysis. Several machine learning models, including Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, and Gradient Boosting, were developed to improve prediction accuracy using topological indices. The datasets were combined with appropriate topological indices for individual compounds. Model performance was evaluated using Mean Squared Error (MSE) and <span>\\(R^2\\)</span> score after hyperparameter tuning via GridSearchCV. Ridge and Lasso Regression models stood out due to their lowest Test MSE averages (3617.74 and 3540.23, respectively) and highest <span>\\(R^2\\)</span> scores (0.9322 and 0.9374, respectively), demonstrating their effectiveness in handling multicollinearity and preventing overfitting. Linear Regression also performed robustly, achieving an MSE of 5249.97 and an <span>\\(R^2\\)</span> of 0.8563, highlighting the suitability of simpler models for datasets with inherent linear relationships. While Random Forest and Gradient Boosting Regression are capable of capturing nonlinear relationships, their performance varied. Random Forest Regression achieved an MSE of 6485.45 and an <span>\\(R^2\\)</span> of 0.6643, while Gradient Boosting initially performed poorly with an MSE of 4488.04 and an <span>\\(R^2\\)</span> of 0.5659. After fine-tuning Gradient Boosting with an expanded hyperparameter grid, its performance improved significantly, achieving a Test MSE of 1494.74 and an <span>\\(R^2\\)</span> of 0.9171. However, it still ranked fourth, suggesting that simpler models like Linear, Ridge, and Lasso Regression may be better suited for this dataset. This work emphasizes the significance of accurate model selection and optimization in QSAR analysis, demonstrating how these approaches can be used to develop dependable predictive models in computational drug discovery and cheminformatics.</p><p>A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and <span>\\(R^2\\)</span> metrics for performance comparison.caption for the graphical abstract: Caption for Graphical Abstract: A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and <span>\\(R^2\\)</span> metrics for performance comparison.</p>","PeriodicalId":790,"journal":{"name":"The European Physical Journal E","volume":"48 4-5","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The European Physical Journal E","FirstCategoryId":"4","ListUrlMain":"https://link.springer.com/article/10.1140/epje/s10189-025-00491-6","RegionNum":4,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

This research investigates the anticipated physicochemical and topological properties of compounds such as drug complexity (C), molecular weight (MW), and topological polar surface area (TPSA) using quantitative structure–activity relationship (QSAR) analysis. Several machine learning models, including Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, and Gradient Boosting, were developed to improve prediction accuracy using topological indices. The datasets were combined with appropriate topological indices for individual compounds. Model performance was evaluated using Mean Squared Error (MSE) and \(R^2\) score after hyperparameter tuning via GridSearchCV. Ridge and Lasso Regression models stood out due to their lowest Test MSE averages (3617.74 and 3540.23, respectively) and highest \(R^2\) scores (0.9322 and 0.9374, respectively), demonstrating their effectiveness in handling multicollinearity and preventing overfitting. Linear Regression also performed robustly, achieving an MSE of 5249.97 and an \(R^2\) of 0.8563, highlighting the suitability of simpler models for datasets with inherent linear relationships. While Random Forest and Gradient Boosting Regression are capable of capturing nonlinear relationships, their performance varied. Random Forest Regression achieved an MSE of 6485.45 and an \(R^2\) of 0.6643, while Gradient Boosting initially performed poorly with an MSE of 4488.04 and an \(R^2\) of 0.5659. After fine-tuning Gradient Boosting with an expanded hyperparameter grid, its performance improved significantly, achieving a Test MSE of 1494.74 and an \(R^2\) of 0.9171. However, it still ranked fourth, suggesting that simpler models like Linear, Ridge, and Lasso Regression may be better suited for this dataset. This work emphasizes the significance of accurate model selection and optimization in QSAR analysis, demonstrating how these approaches can be used to develop dependable predictive models in computational drug discovery and cheminformatics.

A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and \(R^2\) metrics for performance comparison.caption for the graphical abstract: Caption for Graphical Abstract: A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and \(R^2\) metrics for performance comparison.

利用拓扑指数和机器学习模型预测药物性质的改进QSAR方法
本研究利用定量构效关系(QSAR)分析研究了化合物的预期物理化学和拓扑性质,如药物复杂性(C)、分子量(MW)和拓扑极性表面积(TPSA)。为了提高使用拓扑指标的预测精度,开发了几种机器学习模型,包括线性回归、Ridge回归、Lasso回归、随机森林回归和梯度增强。这些数据集与个别化合物的适当拓扑指数相结合。通过GridSearchCV进行超参数调优后,使用均方误差(MSE)和\(R^2\)分数评估模型性能。Ridge和Lasso回归模型因其最低的测试MSE平均值(分别为3617.74和3540.23)和最高的\(R^2\)分数(分别为0.9322和0.9374)而脱颖而出,证明了它们在处理多重共线性和防止过拟合方面的有效性。线性回归也表现稳健,实现了5249.97的MSE和0.8563的\(R^2\),突出了简单模型对具有内在线性关系的数据集的适用性。虽然随机森林和梯度增强回归能够捕获非线性关系,但它们的性能各不相同。随机森林回归的MSE为6485.45,\(R^2\)为0.6643,而梯度增强最初表现不佳,MSE为4488.04,\(R^2\)为0.5659。采用扩展的超参数网格对Gradient Boosting进行微调后,其性能得到显著提高,测试MSE为1494.74,\(R^2\)为0.9171。然而,它仍然排在第四位,这表明更简单的模型,如线性回归、Ridge回归和Lasso回归可能更适合这个数据集。这项工作强调了准确的模型选择和优化在QSAR分析中的重要性,展示了如何使用这些方法在计算药物发现和化学信息学中开发可靠的预测模型。使用QSAR分析预测化学化合物的物理化学和拓扑性质的机器学习管道。这个过程从PubChem的复合数据收集开始,然后是数据预处理、特征工程和特征选择。所选的特征用于训练各种回归模型,包括线性回归、Ridge回归、Lasso回归、随机森林回归和梯度增强回归,并使用MSE和\(R^2\)指标进行性能比较。图形摘要的说明:图形摘要的说明:一个机器学习管道,用于使用QSAR分析预测化合物的物理化学和拓扑性质。这个过程从PubChem的复合数据收集开始,然后是数据预处理、特征工程和特征选择。所选的特征用于训练各种回归模型,包括线性回归、Ridge回归、Lasso回归、随机森林回归和梯度增强回归,并使用MSE和\(R^2\)指标进行性能比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
The European Physical Journal E
The European Physical Journal E CHEMISTRY, PHYSICAL-MATERIALS SCIENCE, MULTIDISCIPLINARY
CiteScore
2.60
自引率
5.60%
发文量
92
审稿时长
3 months
期刊介绍: EPJ E publishes papers describing advances in the understanding of physical aspects of Soft, Liquid and Living Systems. Soft matter is a generic term for a large group of condensed, often heterogeneous systems -- often also called complex fluids -- that display a large response to weak external perturbations and that possess properties governed by slow internal dynamics. Flowing matter refers to all systems that can actually flow, from simple to multiphase liquids, from foams to granular matter. Living matter concerns the new physics that emerges from novel insights into the properties and behaviours of living systems. Furthermore, it aims at developing new concepts and quantitative approaches for the study of biological phenomena. Approaches from soft matter physics and statistical physics play a key role in this research. The journal includes reports of experimental, computational and theoretical studies and appeals to the broad interdisciplinary communities including physics, chemistry, biology, mathematics and materials science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信