Bioactivity Deep Learning for Complex Structure-Free Compound-Protein Interaction Prediction

IF 5.3 2区 化学 Q1 CHEMISTRY, MEDICINAL
Yaowen Gu, , , Song Xia, , , Qi Ouyang, , and , Yingkai Zhang*, 
{"title":"Bioactivity Deep Learning for Complex Structure-Free Compound-Protein Interaction Prediction","authors":"Yaowen Gu,&nbsp;, ,&nbsp;Song Xia,&nbsp;, ,&nbsp;Qi Ouyang,&nbsp;, and ,&nbsp;Yingkai Zhang*,&nbsp;","doi":"10.1021/acs.jcim.5c00741","DOIUrl":null,"url":null,"abstract":"<p >Protein–ligand binding affinity assessment plays a pivotal role in virtual drug screening, yet conventional data-driven approaches rely heavily on limited protein–ligand crystal structures. Structure-free compound-protein interaction (CPI) methods have emerged as competitive alternatives, leveraging extensive bioactivity data to serve as more robust scoring functions. However, these methods often overlook two critical challenges that affect data efficiency and modeling accuracy: the heterogeneity of bioactivity data due to differences in bioassay measurements and the presence of activity cliffs (ACs)─small chemical modifications that lead to significant changes in bioactivity, which have not been thoroughly investigated in CPI modeling. To address these challenges, we present CPI2M, a large-scale CPI benchmark data set containing approximately 2 million bioactivity data points across four activity types (<i>K</i><sub>i</sub>, <i>K</i><sub>d</sub>, EC<sub>50</sub>, and IC<sub>50</sub>) with AC annotations. Moreover, we developed GGAP-CPI, a complex structure-free deep learning model trained by integrated bioactivity learning and designed to mitigate the impact of ACs on CPI prediction through advanced protein representation modeling. Our comprehensive evaluation demonstrates that GGAP-CPI outperforms 12 target-specific and 7 general CPI baselines across 4 scenarios (general CPI prediction, rare protein prediction, transfer learning, and virtual screening) on 7 benchmarks (CPI2M, MoleculeACE, CASF-2016, MerckFEP, DUD-E, DEKOIS-v2, and LIT-PCBA). Furthermore, GGAP-CPI is able to not only deliver stable bioactivity predictions but also measure prediction uncertainty and enrich binding pocket residues and interactions, underscoring its applicability to real-world bioactivity assessments and virtual drug screening.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 19","pages":"9910–9926"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.5c00741","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00741","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0

Abstract

Protein–ligand binding affinity assessment plays a pivotal role in virtual drug screening, yet conventional data-driven approaches rely heavily on limited protein–ligand crystal structures. Structure-free compound-protein interaction (CPI) methods have emerged as competitive alternatives, leveraging extensive bioactivity data to serve as more robust scoring functions. However, these methods often overlook two critical challenges that affect data efficiency and modeling accuracy: the heterogeneity of bioactivity data due to differences in bioassay measurements and the presence of activity cliffs (ACs)─small chemical modifications that lead to significant changes in bioactivity, which have not been thoroughly investigated in CPI modeling. To address these challenges, we present CPI2M, a large-scale CPI benchmark data set containing approximately 2 million bioactivity data points across four activity types (Ki, Kd, EC50, and IC50) with AC annotations. Moreover, we developed GGAP-CPI, a complex structure-free deep learning model trained by integrated bioactivity learning and designed to mitigate the impact of ACs on CPI prediction through advanced protein representation modeling. Our comprehensive evaluation demonstrates that GGAP-CPI outperforms 12 target-specific and 7 general CPI baselines across 4 scenarios (general CPI prediction, rare protein prediction, transfer learning, and virtual screening) on 7 benchmarks (CPI2M, MoleculeACE, CASF-2016, MerckFEP, DUD-E, DEKOIS-v2, and LIT-PCBA). Furthermore, GGAP-CPI is able to not only deliver stable bioactivity predictions but also measure prediction uncertainty and enrich binding pocket residues and interactions, underscoring its applicability to real-world bioactivity assessments and virtual drug screening.

生物活性深度学习用于复杂无结构化合物-蛋白质相互作用预测。
蛋白质-配体结合亲和力评估在虚拟药物筛选中起着关键作用,然而传统的数据驱动方法严重依赖于有限的蛋白质-配体晶体结构。无结构化合物-蛋白质相互作用(CPI)方法已经成为有竞争力的替代方法,利用广泛的生物活性数据作为更强大的评分功能。然而,这些方法往往忽略了影响数据效率和建模准确性的两个关键挑战:由于生物测定测量的差异而导致生物活性数据的异质性和活性悬崖(ACs)的存在──导致生物活性显著变化的小化学修饰,这在CPI建模中尚未得到彻底研究。为了解决这些挑战,我们提出了CPI2M,这是一个大型CPI基准数据集,包含四种活性类型(Ki, Kd, EC50和IC50)的大约200万个生物活性数据点,并带有AC注释。此外,我们开发了GGAP-CPI,这是一个通过综合生物活性学习训练的复杂无结构深度学习模型,旨在通过先进的蛋白质表示模型减轻ACs对CPI预测的影响。我们的综合评估表明,GGAP-CPI在7个基准(CPI2M、MoleculeACE、CASF-2016、MerckFEP、ddu - e、DEKOIS-v2和lite - pcba)上,在4个场景(一般CPI预测、罕见蛋白预测、迁移学习和虚拟筛选)中优于12个目标特异性CPI基线和7个一般CPI基线。此外,GGAP-CPI不仅能够提供稳定的生物活性预测,还能够测量预测的不确定性,丰富结合袋残基和相互作用,强调其在现实世界生物活性评估和虚拟药物筛选中的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.80
自引率
10.70%
发文量
529
审稿时长
1.4 months
期刊介绍: The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信