CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen
{"title":"CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction","authors":"Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen","doi":"arxiv-2409.03773","DOIUrl":null,"url":null,"abstract":"Accurately measuring protein-RNA binding affinity is crucial in many\nbiological processes and drug design. Previous computational methods for\nprotein-RNA binding affinity prediction rely on either sequence or structure\nfeatures, unable to capture the binding mechanisms comprehensively. The recent\nemerging pre-trained language models trained on massive unsupervised sequences\nof protein and RNA have shown strong representation ability for various\nin-domain downstream tasks, including binding site prediction. However,\napplying different-domain language models collaboratively for complex-level\ntasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained\nlanguage models from different biological domains via Complex structure for\nProtein-RNA binding Affinity prediction. We demonstrate for the first time that\ncross-biological modal language models can collaborate to improve binding\naffinity prediction. We propose a Co-Former to combine the cross-modal sequence\nand structure information and a bi-scope pre-training strategy for improving\nCo-Former's interaction understanding. Meanwhile, we build the largest\nprotein-RNA binding affinity dataset PRA310 for performance evaluation. We also\ntest our model on a public dataset for mutation effect prediction. CoPRA\nreaches state-of-the-art performance on all the datasets. We provide extensive\nanalyses and verify that CoPRA can (1) accurately predict the protein-RNA\nbinding affinity; (2) understand the binding affinity change caused by\nmutations; and (3) benefit from scaling data and model size.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former's interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.
CoPRA:将跨域预训练序列模型与复杂结构衔接起来,用于蛋白质-RNA 结合亲和力预测
精确测量蛋白质与 RNA 的结合亲和力对许多生物学过程和药物设计至关重要。以往预测蛋白质-RNA 结合亲和力的计算方法依赖于序列或结构特征,无法全面捕捉结合机制。最近出现的预训练语言模型是在大量无监督的蛋白质和 RNA 序列上训练出来的,对于包括结合位点预测在内的各种域内下游任务显示出很强的表征能力。然而,将不同领域的语言模型协同应用于复杂水平任务的研究仍处于探索阶段。在本文中,我们提出 CoPRA,通过复杂结构将不同生物领域预先训练好的语言模型连接起来,用于蛋白质-RNA 结合亲和力预测。我们首次证明了跨生物模态语言模型可以合作改进结合亲和力预测。我们提出了一种结合跨模态序列和结构信息的联合模型(Co-Former)和一种双范围预训练策略,以提高联合模型对相互作用的理解。同时,我们建立了最大的蛋白质-RNA结合亲和力数据集 PRA310 进行性能评估。我们还在一个用于突变效应预测的公开数据集上测试了我们的模型。CoPRA 在所有数据集上都达到了最先进的性能。我们进行了广泛的分析,并验证了 CoPRA 能够:(1)准确预测蛋白质-RNA 结合亲和力;(2)理解突变引起的结合亲和力变化;以及(3)受益于数据和模型规模的缩放。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信