Aljoša Smajić, Thomas Steger-Hartmann, Gerhard F Ecker, Anke Hackl
{"title":"Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets.","authors":"Aljoša Smajić, Thomas Steger-Hartmann, Gerhard F Ecker, Anke Hackl","doi":"10.1021/acs.chemrestox.4c00347","DOIUrl":null,"url":null,"abstract":"<p><p>When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between -0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.</p>","PeriodicalId":31,"journal":{"name":"Chemical Research in Toxicology","volume":" ","pages":"820-833"},"PeriodicalIF":3.8000,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12093362/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Research in Toxicology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1021/acs.chemrestox.4c00347","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/20 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between -0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.
期刊介绍:
Chemical Research in Toxicology publishes Articles, Rapid Reports, Chemical Profiles, Reviews, Perspectives, Letters to the Editor, and ToxWatch on a wide range of topics in Toxicology that inform a chemical and molecular understanding and capacity to predict biological outcomes on the basis of structures and processes. The overarching goal of activities reported in the Journal are to provide knowledge and innovative approaches needed to promote intelligent solutions for human safety and ecosystem preservation. The journal emphasizes insight concerning mechanisms of toxicity over phenomenological observations. It upholds rigorous chemical, physical and mathematical standards for characterization and application of modern techniques.