{"title":"Systematic evaluation of data preprocessing and model selection strategies for reliable pIC<sub>50</sub> prediction of acetylcholinesterase inhibitors.","authors":"E Delibaş, H I Güler","doi":"10.1080/1062936X.2026.2647204","DOIUrl":null,"url":null,"abstract":"<p><p>Predicting acetylcholinesterase (AChE) inhibitory activity is important in drug discovery. This study evaluates molecular descriptor - based machine learning models to predict AChE activity as pIC<sub>50</sub> values. The primary objective was to comparatively investigate the impact of different data preprocessing strategies on prediction performance and model selection under challenging chemical datasets exhibiting low correlation structures. Tree based gradient boosting algorithms, namely CatBoost and XGBoost, together with sensitive regression models including Support Vector Regression and Multilayer Perceptron, were examined, and model specific data preparation pipelines were applied according to their structural assumptions. The target variable was stabilized through logarithmic transformation and winsorization of IC<sub>50</sub> values. Model performance was assessed using both a 70-15-15 train-validation-test split and a 10-fold cross validation protocol. Furthermore, stacking based ensemble learning strategies were explored to enhance generalization capability. The results demonstrate that predictive performance is predominantly constrained by intrinsic dataset characteristics rather than algorithmic selection. Optimized tree-based models achieved the highest accuracy, while stacking provided only marginal improvements over the best individual learners. To improve interpretability, SHAP based explainable artificial intelligence analysis was conducted, highlighting the contributions of biologically meaningful molecular descriptors, and offers guidance for future studies addressing comparable biochemical modelling challenges.</p>","PeriodicalId":21446,"journal":{"name":"SAR and QSAR in Environmental Research","volume":" ","pages":"185-204"},"PeriodicalIF":2.3000,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SAR and QSAR in Environmental Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1080/1062936X.2026.2647204","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/4/10 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Predicting acetylcholinesterase (AChE) inhibitory activity is important in drug discovery. This study evaluates molecular descriptor - based machine learning models to predict AChE activity as pIC50 values. The primary objective was to comparatively investigate the impact of different data preprocessing strategies on prediction performance and model selection under challenging chemical datasets exhibiting low correlation structures. Tree based gradient boosting algorithms, namely CatBoost and XGBoost, together with sensitive regression models including Support Vector Regression and Multilayer Perceptron, were examined, and model specific data preparation pipelines were applied according to their structural assumptions. The target variable was stabilized through logarithmic transformation and winsorization of IC50 values. Model performance was assessed using both a 70-15-15 train-validation-test split and a 10-fold cross validation protocol. Furthermore, stacking based ensemble learning strategies were explored to enhance generalization capability. The results demonstrate that predictive performance is predominantly constrained by intrinsic dataset characteristics rather than algorithmic selection. Optimized tree-based models achieved the highest accuracy, while stacking provided only marginal improvements over the best individual learners. To improve interpretability, SHAP based explainable artificial intelligence analysis was conducted, highlighting the contributions of biologically meaningful molecular descriptors, and offers guidance for future studies addressing comparable biochemical modelling challenges.
期刊介绍:
SAR and QSAR in Environmental Research is an international journal welcoming papers on the fundamental and practical aspects of the structure-activity and structure-property relationships in the fields of environmental science, agrochemistry, toxicology, pharmacology and applied chemistry. A unique aspect of the journal is the focus on emerging techniques for the building of SAR and QSAR models in these widely varying fields. The scope of the journal includes, but is not limited to, the topics of topological and physicochemical descriptors, mathematical, statistical and graphical methods for data analysis, computer methods and programs, original applications and comparative studies. In addition to primary scientific papers, the journal contains reviews of books and software and news of conferences. Special issues on topics of current and widespread interest to the SAR and QSAR community will be published from time to time.