Systematic evaluation of data preprocessing and model selection strategies for reliable pIC₅₀ prediction of acetylcholinesterase inhibitors.

IF 2.3 3区环境科学与生态学 Q3 CHEMISTRY, MULTIDISCIPLINARY

SAR and QSAR in Environmental Research Pub Date : 2026-02-01 Epub Date: 2026-04-10 DOI:10.1080/1062936X.2026.2647204

E Delibaş, H I Güler

{"title":"Systematic evaluation of data preprocessing and model selection strategies for reliable pIC50 prediction of acetylcholinesterase inhibitors.","authors":"E Delibaş, H I Güler","doi":"10.1080/1062936X.2026.2647204","DOIUrl":null,"url":null,"abstract":"Predicting acetylcholinesterase (AChE) inhibitory activity is important in drug discovery. This study evaluates molecular descriptor - based machine learning models to predict AChE activity as pIC50 values. The primary objective was to comparatively investigate the impact of different data preprocessing strategies on prediction performance and model selection under challenging chemical datasets exhibiting low correlation structures. Tree based gradient boosting algorithms, namely CatBoost and XGBoost, together with sensitive regression models including Support Vector Regression and Multilayer Perceptron, were examined, and model specific data preparation pipelines were applied according to their structural assumptions. The target variable was stabilized through logarithmic transformation and winsorization of IC50 values. Model performance was assessed using both a 70-15-15 train-validation-test split and a 10-fold cross validation protocol. Furthermore, stacking based ensemble learning strategies were explored to enhance generalization capability. The results demonstrate that predictive performance is predominantly constrained by intrinsic dataset characteristics rather than algorithmic selection. Optimized tree-based models achieved the highest accuracy, while stacking provided only marginal improvements over the best individual learners. To improve interpretability, SHAP based explainable artificial intelligence analysis was conducted, highlighting the contributions of biologically meaningful molecular descriptors, and offers guidance for future studies addressing comparable biochemical modelling challenges.","PeriodicalId":21446,"journal":{"name":"SAR and QSAR in Environmental Research","volume":" ","pages":"185-204"},"PeriodicalIF":2.3000,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SAR and QSAR in Environmental Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1080/1062936X.2026.2647204","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/4/10 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Predicting acetylcholinesterase (AChE) inhibitory activity is important in drug discovery. This study evaluates molecular descriptor - based machine learning models to predict AChE activity as pIC₅₀ values. The primary objective was to comparatively investigate the impact of different data preprocessing strategies on prediction performance and model selection under challenging chemical datasets exhibiting low correlation structures. Tree based gradient boosting algorithms, namely CatBoost and XGBoost, together with sensitive regression models including Support Vector Regression and Multilayer Perceptron, were examined, and model specific data preparation pipelines were applied according to their structural assumptions. The target variable was stabilized through logarithmic transformation and winsorization of IC₅₀ values. Model performance was assessed using both a 70-15-15 train-validation-test split and a 10-fold cross validation protocol. Furthermore, stacking based ensemble learning strategies were explored to enhance generalization capability. The results demonstrate that predictive performance is predominantly constrained by intrinsic dataset characteristics rather than algorithmic selection. Optimized tree-based models achieved the highest accuracy, while stacking provided only marginal improvements over the best individual learners. To improve interpretability, SHAP based explainable artificial intelligence analysis was conducted, highlighting the contributions of biologically meaningful molecular descriptors, and offers guidance for future studies addressing comparable biochemical modelling challenges.

查看原文本刊更多论文

对乙酰胆碱酯酶抑制剂pIC50可靠预测的数据预处理和模型选择策略进行系统评估。

预测乙酰胆碱酯酶（AChE）抑制活性在药物开发中具有重要意义。本研究评估基于分子描述符的机器学习模型以pIC50值预测AChE活性。本研究的主要目的是比较研究不同的数据预处理策略对具有低相关性结构的化学数据集的预测性能和模型选择的影响。研究了基于树的梯度增强算法CatBoost和XGBoost，以及支持向量回归和多层感知机等敏感回归模型，并根据其结构假设应用了模型特定的数据准备管道。通过对IC50值进行对数变换和winsorization，稳定目标变量。使用70-15-15训练-验证-测试分割和10倍交叉验证协议评估模型性能。在此基础上，研究了基于叠加的集成学习策略来提高泛化能力。结果表明，预测性能主要受固有数据集特征而不是算法选择的约束。优化的基于树的模型达到了最高的精度，而堆叠只提供了最好的个人学习者的边际改进。为了提高可解释性，进行了基于SHAP的可解释人工智能分析，突出了生物学上有意义的分子描述符的贡献，并为未来解决类似生化建模挑战的研究提供了指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SAR and QSAR in Environmental Research 环境科学-毒理学

CiteScore

5.20

自引率

20.00%

发文量

审稿时长

>24 weeks

期刊介绍： SAR and QSAR in Environmental Research is an international journal welcoming papers on the fundamental and practical aspects of the structure-activity and structure-property relationships in the fields of environmental science, agrochemistry, toxicology, pharmacology and applied chemistry. A unique aspect of the journal is the focus on emerging techniques for the building of SAR and QSAR models in these widely varying fields. The scope of the journal includes, but is not limited to, the topics of topological and physicochemical descriptors, mathematical, statistical and graphical methods for data analysis, computer methods and programs, original applications and comparative studies. In addition to primary scientific papers, the journal contains reviews of books and software and news of conferences. Special issues on topics of current and widespread interest to the SAR and QSAR community will be published from time to time.