CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-21 DOI:arxiv-2409.03773

Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen

{"title":"CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction","authors":"Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen","doi":"arxiv-2409.03773","DOIUrl":null,"url":null,"abstract":"Accurately measuring protein-RNA binding affinity is crucial in many\nbiological processes and drug design. Previous computational methods for\nprotein-RNA binding affinity prediction rely on either sequence or structure\nfeatures, unable to capture the binding mechanisms comprehensively. The recent\nemerging pre-trained language models trained on massive unsupervised sequences\nof protein and RNA have shown strong representation ability for various\nin-domain downstream tasks, including binding site prediction. However,\napplying different-domain language models collaboratively for complex-level\ntasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained\nlanguage models from different biological domains via Complex structure for\nProtein-RNA binding Affinity prediction. We demonstrate for the first time that\ncross-biological modal language models can collaborate to improve binding\naffinity prediction. We propose a Co-Former to combine the cross-modal sequence\nand structure information and a bi-scope pre-training strategy for improving\nCo-Former's interaction understanding. Meanwhile, we build the largest\nprotein-RNA binding affinity dataset PRA310 for performance evaluation. We also\ntest our model on a public dataset for mutation effect prediction. CoPRA\nreaches state-of-the-art performance on all the datasets. We provide extensive\nanalyses and verify that CoPRA can (1) accurately predict the protein-RNA\nbinding affinity; (2) understand the binding affinity change caused by\nmutations; and (3) benefit from scaling data and model size.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former's interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.

查看原文本刊更多论文

CoPRA：将跨域预训练序列模型与复杂结构衔接起来，用于蛋白质-RNA 结合亲和力预测

精确测量蛋白质与 RNA 的结合亲和力对许多生物学过程和药物设计至关重要。以往预测蛋白质-RNA 结合亲和力的计算方法依赖于序列或结构特征，无法全面捕捉结合机制。最近出现的预训练语言模型是在大量无监督的蛋白质和 RNA 序列上训练出来的，对于包括结合位点预测在内的各种域内下游任务显示出很强的表征能力。然而，将不同领域的语言模型协同应用于复杂水平任务的研究仍处于探索阶段。在本文中，我们提出 CoPRA，通过复杂结构将不同生物领域预先训练好的语言模型连接起来，用于蛋白质-RNA 结合亲和力预测。我们首次证明了跨生物模态语言模型可以合作改进结合亲和力预测。我们提出了一种结合跨模态序列和结构信息的联合模型（Co-Former）和一种双范围预训练策略，以提高联合模型对相互作用的理解。同时，我们建立了最大的蛋白质-RNA结合亲和力数据集 PRA310 进行性能评估。我们还在一个用于突变效应预测的公开数据集上测试了我们的模型。CoPRA 在所有数据集上都达到了最先进的性能。我们进行了广泛的分析，并验证了 CoPRA 能够：（1）准确预测蛋白质-RNA 结合亲和力；（2）理解突变引起的结合亲和力变化；以及（3）受益于数据和模型规模的缩放。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量