Bioactivity Deep Learning for Complex Structure-Free Compound-Protein Interaction Prediction

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-09-16 DOI:10.1021/acs.jcim.5c00741

Yaowen Gu, , , Song Xia, , , Qi Ouyang, , and , Yingkai Zhang*,

{"title":"Bioactivity Deep Learning for Complex Structure-Free Compound-Protein Interaction Prediction","authors":"Yaowen Gu, , , Song Xia, , , Qi Ouyang, , and , Yingkai Zhang*, ","doi":"10.1021/acs.jcim.5c00741","DOIUrl":null,"url":null,"abstract":"Protein–ligand binding affinity assessment plays a pivotal role in virtual drug screening, yet conventional data-driven approaches rely heavily on limited protein–ligand crystal structures. Structure-free compound-protein interaction (CPI) methods have emerged as competitive alternatives, leveraging extensive bioactivity data to serve as more robust scoring functions. However, these methods often overlook two critical challenges that affect data efficiency and modeling accuracy: the heterogeneity of bioactivity data due to differences in bioassay measurements and the presence of activity cliffs (ACs)─small chemical modifications that lead to significant changes in bioactivity, which have not been thoroughly investigated in CPI modeling. To address these challenges, we present CPI2M, a large-scale CPI benchmark data set containing approximately 2 million bioactivity data points across four activity types (Ki, Kd, EC50, and IC50) with AC annotations. Moreover, we developed GGAP-CPI, a complex structure-free deep learning model trained by integrated bioactivity learning and designed to mitigate the impact of ACs on CPI prediction through advanced protein representation modeling. Our comprehensive evaluation demonstrates that GGAP-CPI outperforms 12 target-specific and 7 general CPI baselines across 4 scenarios (general CPI prediction, rare protein prediction, transfer learning, and virtual screening) on 7 benchmarks (CPI2M, MoleculeACE, CASF-2016, MerckFEP, DUD-E, DEKOIS-v2, and LIT-PCBA). Furthermore, GGAP-CPI is able to not only deliver stable bioactivity predictions but also measure prediction uncertainty and enrich binding pocket residues and interactions, underscoring its applicability to real-world bioactivity assessments and virtual drug screening.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 19","pages":"9910–9926"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.5c00741","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00741","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Protein–ligand binding affinity assessment plays a pivotal role in virtual drug screening, yet conventional data-driven approaches rely heavily on limited protein–ligand crystal structures. Structure-free compound-protein interaction (CPI) methods have emerged as competitive alternatives, leveraging extensive bioactivity data to serve as more robust scoring functions. However, these methods often overlook two critical challenges that affect data efficiency and modeling accuracy: the heterogeneity of bioactivity data due to differences in bioassay measurements and the presence of activity cliffs (ACs)─small chemical modifications that lead to significant changes in bioactivity, which have not been thoroughly investigated in CPI modeling. To address these challenges, we present CPI2M, a large-scale CPI benchmark data set containing approximately 2 million bioactivity data points across four activity types (K_i, K_d, EC₅₀, and IC₅₀) with AC annotations. Moreover, we developed GGAP-CPI, a complex structure-free deep learning model trained by integrated bioactivity learning and designed to mitigate the impact of ACs on CPI prediction through advanced protein representation modeling. Our comprehensive evaluation demonstrates that GGAP-CPI outperforms 12 target-specific and 7 general CPI baselines across 4 scenarios (general CPI prediction, rare protein prediction, transfer learning, and virtual screening) on 7 benchmarks (CPI2M, MoleculeACE, CASF-2016, MerckFEP, DUD-E, DEKOIS-v2, and LIT-PCBA). Furthermore, GGAP-CPI is able to not only deliver stable bioactivity predictions but also measure prediction uncertainty and enrich binding pocket residues and interactions, underscoring its applicability to real-world bioactivity assessments and virtual drug screening.

查看原文本刊更多论文

生物活性深度学习用于复杂无结构化合物-蛋白质相互作用预测。

蛋白质-配体结合亲和力评估在虚拟药物筛选中起着关键作用，然而传统的数据驱动方法严重依赖于有限的蛋白质-配体晶体结构。无结构化合物-蛋白质相互作用（CPI）方法已经成为有竞争力的替代方法，利用广泛的生物活性数据作为更强大的评分功能。然而，这些方法往往忽略了影响数据效率和建模准确性的两个关键挑战：由于生物测定测量的差异而导致生物活性数据的异质性和活性悬崖（ACs）的存在──导致生物活性显著变化的小化学修饰，这在CPI建模中尚未得到彻底研究。为了解决这些挑战，我们提出了CPI2M，这是一个大型CPI基准数据集，包含四种活性类型（Ki, Kd， EC50和IC50）的大约200万个生物活性数据点，并带有AC注释。此外，我们开发了GGAP-CPI，这是一个通过综合生物活性学习训练的复杂无结构深度学习模型，旨在通过先进的蛋白质表示模型减轻ACs对CPI预测的影响。我们的综合评估表明，GGAP-CPI在7个基准（CPI2M、MoleculeACE、CASF-2016、MerckFEP、ddu - e、DEKOIS-v2和lite - pcba）上，在4个场景（一般CPI预测、罕见蛋白预测、迁移学习和虚拟筛选）中优于12个目标特异性CPI基线和7个一般CPI基线。此外，GGAP-CPI不仅能够提供稳定的生物活性预测，还能够测量预测的不确定性，丰富结合袋残基和相互作用，强调其在现实世界生物活性评估和虚拟药物筛选中的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.