High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy.

IF 2.3 3区 环境科学与生态学 Q3 CHEMISTRY, MULTIDISCIPLINARY
N G Nikolov, E B Wedebye
{"title":"High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy.","authors":"N G Nikolov, E B Wedebye","doi":"10.1080/1062936X.2025.2510964","DOIUrl":null,"url":null,"abstract":"<p><p>The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.</p>","PeriodicalId":21446,"journal":{"name":"SAR and QSAR in Environmental Research","volume":" ","pages":"1-27"},"PeriodicalIF":2.3000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SAR and QSAR in Environmental Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1080/1062936X.2025.2510964","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.

高性能,大化学覆盖或两者兼有:DanishQSAR和专为灵敏度,特异性或平衡精度优化的事后集成模型的层次结构。
适用域大小和预测精度之间的权衡是QSAR中一个众所周知的现象。我们开发了一种建模方法,该方法选择具有不同适用领域大小和不同预测精度的多个模型,而不是单一的最佳模型。该方法在二元分类QSAR建模软件DanishQSAR中实现,集描述子计算、描述子选择、模型开发、验证和应用于一体。软件中可用的各种方法和选项在模型开发过程中使用基于交叉验证的网格搜索和事后集成建模版本自动测试和有效组合。然后分析由此产生的大量不同的模型候选池,以生成三个模型层次,分别针对最小到最大覆盖级别对灵敏度、特异性或平衡精度进行优化。在预测查询组合时,系统提供来自三个层次结构中所有模型的预测,在所有覆盖级别上使用用户定义的步骤,以及单个模型预测性能,从而生成预测概要文件,而不是来自单个模型的一个预测。使用来自丹麦(Q)SAR数据库(https://qsar.food.dtu.dk)的20个数据集来演示性能。经交叉和外部验证,所建立的二元分类模型具有较高的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.20
自引率
20.00%
发文量
78
审稿时长
>24 weeks
期刊介绍: SAR and QSAR in Environmental Research is an international journal welcoming papers on the fundamental and practical aspects of the structure-activity and structure-property relationships in the fields of environmental science, agrochemistry, toxicology, pharmacology and applied chemistry. A unique aspect of the journal is the focus on emerging techniques for the building of SAR and QSAR models in these widely varying fields. The scope of the journal includes, but is not limited to, the topics of topological and physicochemical descriptors, mathematical, statistical and graphical methods for data analysis, computer methods and programs, original applications and comparative studies. In addition to primary scientific papers, the journal contains reviews of books and software and news of conferences. Special issues on topics of current and widespread interest to the SAR and QSAR community will be published from time to time.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信