Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi
{"title":"BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.","authors":"Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi","doi":"10.1186/s12859-024-05968-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.</p><p><strong>Results: </strong>By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.</p><p><strong>Conclusions: </strong>We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"340"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526688/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05968-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.
Results: By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.
Conclusions: We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .
背景:基于深度学习的药物-靶点亲和力(DTA)预测方法表现出令人印象深刻的性能,尽管相对于可用数据而言,训练参数的数量较多。以往的研究强调了数据集偏差的存在,认为仅根据蛋白质或配体结构训练的模型可能与根据复杂结构训练的模型表现类似。不过,这些研究并没有提出解决方案,而只是侧重于分析基于复杂结构的模型。即使排除了配体,在复合结构上训练的纯蛋白质模型仍然会在结合位点纳入一些配体信息。因此,由于潜在的数据集偏差,仅使用化合物或蛋白质特征能否准确预测结合亲和力尚不清楚。在本研究中,我们将分析范围扩大到了综合数据库,并使用多层感知器模型通过基于化合物和蛋白质特征的方法研究了数据集偏差。我们评估了这种偏差对当前预测模型的影响,并提出了结合亲和力相似性探索者(BASE)网络服务,该服务可提供减少偏差的数据集:结果:通过使用多层感知器模型分析八个结合亲和力数据库,我们证实了一种偏差,即仅使用化合物特征就能准确预测化合物与蛋白质的结合亲和力。产生这种偏差的原因是,大多数化合物的结合亲和力都是一致的,这是因为它们的靶蛋白在序列或功能上具有高度相似性。我们基于化合物指纹图谱的均匀簇逼近和投影分析进一步显示,低变异和高变异化合物在结构上没有明显差异。这表明,导致结合亲和力一致的主要因素是蛋白质的相似性,而不是化合物的结构。针对这一偏差,我们创建了训练集和测试集之间蛋白质相似性逐渐降低的数据集,观察到了模型性能的显著变化。我们开发了 BASE 网络服务,允许研究人员下载和使用这些数据集。特征重要性分析表明,以前的模型严重依赖蛋白质特征。然而,使用减少偏差的数据集提高了化合物和相互作用特征的重要性,从而能够更均衡地提取关键特征:我们提出了 BASE 网络服务,提供现有模型的亲和力预测结果和偏倚还原数据集。这些资源有助于开发通用、稳健的预测模型,提高药物发现过程中 DTA 预测的准确性和可靠性。BASE 可通过 https://synbi2024.kaist.ac.kr/base 免费在线获取。
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.