High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy.
{"title":"High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy.","authors":"N G Nikolov, E B Wedebye","doi":"10.1080/1062936X.2025.2510964","DOIUrl":null,"url":null,"abstract":"<p><p>The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.</p>","PeriodicalId":21446,"journal":{"name":"SAR and QSAR in Environmental Research","volume":" ","pages":"1-27"},"PeriodicalIF":2.3000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SAR and QSAR in Environmental Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1080/1062936X.2025.2510964","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.
期刊介绍:
SAR and QSAR in Environmental Research is an international journal welcoming papers on the fundamental and practical aspects of the structure-activity and structure-property relationships in the fields of environmental science, agrochemistry, toxicology, pharmacology and applied chemistry. A unique aspect of the journal is the focus on emerging techniques for the building of SAR and QSAR models in these widely varying fields. The scope of the journal includes, but is not limited to, the topics of topological and physicochemical descriptors, mathematical, statistical and graphical methods for data analysis, computer methods and programs, original applications and comparative studies. In addition to primary scientific papers, the journal contains reviews of books and software and news of conferences. Special issues on topics of current and widespread interest to the SAR and QSAR community will be published from time to time.