Muhammad Shoaib Sardar, Muhammad Shahid Iqbal, Muhammad Mudassar Hassan, Changjiang Bu, Sharafat Hussain
{"title":"Improved QSAR methods for predicting drug properties utilizing topological indices and machine learning models","authors":"Muhammad Shoaib Sardar, Muhammad Shahid Iqbal, Muhammad Mudassar Hassan, Changjiang Bu, Sharafat Hussain","doi":"10.1140/epje/s10189-025-00491-6","DOIUrl":null,"url":null,"abstract":"<p>This research investigates the anticipated physicochemical and topological properties of compounds such as drug complexity (C), molecular weight (MW), and topological polar surface area (TPSA) using quantitative structure–activity relationship (QSAR) analysis. Several machine learning models, including Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, and Gradient Boosting, were developed to improve prediction accuracy using topological indices. The datasets were combined with appropriate topological indices for individual compounds. Model performance was evaluated using Mean Squared Error (MSE) and <span>\\(R^2\\)</span> score after hyperparameter tuning via GridSearchCV. Ridge and Lasso Regression models stood out due to their lowest Test MSE averages (3617.74 and 3540.23, respectively) and highest <span>\\(R^2\\)</span> scores (0.9322 and 0.9374, respectively), demonstrating their effectiveness in handling multicollinearity and preventing overfitting. Linear Regression also performed robustly, achieving an MSE of 5249.97 and an <span>\\(R^2\\)</span> of 0.8563, highlighting the suitability of simpler models for datasets with inherent linear relationships. While Random Forest and Gradient Boosting Regression are capable of capturing nonlinear relationships, their performance varied. Random Forest Regression achieved an MSE of 6485.45 and an <span>\\(R^2\\)</span> of 0.6643, while Gradient Boosting initially performed poorly with an MSE of 4488.04 and an <span>\\(R^2\\)</span> of 0.5659. After fine-tuning Gradient Boosting with an expanded hyperparameter grid, its performance improved significantly, achieving a Test MSE of 1494.74 and an <span>\\(R^2\\)</span> of 0.9171. However, it still ranked fourth, suggesting that simpler models like Linear, Ridge, and Lasso Regression may be better suited for this dataset. This work emphasizes the significance of accurate model selection and optimization in QSAR analysis, demonstrating how these approaches can be used to develop dependable predictive models in computational drug discovery and cheminformatics.</p><p>A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and <span>\\(R^2\\)</span> metrics for performance comparison.caption for the graphical abstract: Caption for Graphical Abstract: A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and <span>\\(R^2\\)</span> metrics for performance comparison.</p>","PeriodicalId":790,"journal":{"name":"The European Physical Journal E","volume":"48 4-5","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The European Physical Journal E","FirstCategoryId":"4","ListUrlMain":"https://link.springer.com/article/10.1140/epje/s10189-025-00491-6","RegionNum":4,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
This research investigates the anticipated physicochemical and topological properties of compounds such as drug complexity (C), molecular weight (MW), and topological polar surface area (TPSA) using quantitative structure–activity relationship (QSAR) analysis. Several machine learning models, including Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regression, and Gradient Boosting, were developed to improve prediction accuracy using topological indices. The datasets were combined with appropriate topological indices for individual compounds. Model performance was evaluated using Mean Squared Error (MSE) and \(R^2\) score after hyperparameter tuning via GridSearchCV. Ridge and Lasso Regression models stood out due to their lowest Test MSE averages (3617.74 and 3540.23, respectively) and highest \(R^2\) scores (0.9322 and 0.9374, respectively), demonstrating their effectiveness in handling multicollinearity and preventing overfitting. Linear Regression also performed robustly, achieving an MSE of 5249.97 and an \(R^2\) of 0.8563, highlighting the suitability of simpler models for datasets with inherent linear relationships. While Random Forest and Gradient Boosting Regression are capable of capturing nonlinear relationships, their performance varied. Random Forest Regression achieved an MSE of 6485.45 and an \(R^2\) of 0.6643, while Gradient Boosting initially performed poorly with an MSE of 4488.04 and an \(R^2\) of 0.5659. After fine-tuning Gradient Boosting with an expanded hyperparameter grid, its performance improved significantly, achieving a Test MSE of 1494.74 and an \(R^2\) of 0.9171. However, it still ranked fourth, suggesting that simpler models like Linear, Ridge, and Lasso Regression may be better suited for this dataset. This work emphasizes the significance of accurate model selection and optimization in QSAR analysis, demonstrating how these approaches can be used to develop dependable predictive models in computational drug discovery and cheminformatics.
A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and \(R^2\) metrics for performance comparison.caption for the graphical abstract: Caption for Graphical Abstract: A machine learning pipeline for predicting physicochemical and topological properties of chemical compounds using QSAR analysis. The process begins with compound data collection from PubChem, followed by data preprocessing, feature engineering, and feature selection. The selected features are used to train various regression models-including Linear, Ridge, Lasso, Random Forest, and Gradient Boosting Regression-evaluated using MSE and \(R^2\) metrics for performance comparison.
期刊介绍:
EPJ E publishes papers describing advances in the understanding of physical aspects of Soft, Liquid and Living Systems.
Soft matter is a generic term for a large group of condensed, often heterogeneous systems -- often also called complex fluids -- that display a large response to weak external perturbations and that possess properties governed by slow internal dynamics.
Flowing matter refers to all systems that can actually flow, from simple to multiphase liquids, from foams to granular matter.
Living matter concerns the new physics that emerges from novel insights into the properties and behaviours of living systems. Furthermore, it aims at developing new concepts and quantitative approaches for the study of biological phenomena. Approaches from soft matter physics and statistical physics play a key role in this research.
The journal includes reports of experimental, computational and theoretical studies and appeals to the broad interdisciplinary communities including physics, chemistry, biology, mathematics and materials science.