{"title":"CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction","authors":"Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen","doi":"arxiv-2409.03773","DOIUrl":null,"url":null,"abstract":"Accurately measuring protein-RNA binding affinity is crucial in many\nbiological processes and drug design. Previous computational methods for\nprotein-RNA binding affinity prediction rely on either sequence or structure\nfeatures, unable to capture the binding mechanisms comprehensively. The recent\nemerging pre-trained language models trained on massive unsupervised sequences\nof protein and RNA have shown strong representation ability for various\nin-domain downstream tasks, including binding site prediction. However,\napplying different-domain language models collaboratively for complex-level\ntasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained\nlanguage models from different biological domains via Complex structure for\nProtein-RNA binding Affinity prediction. We demonstrate for the first time that\ncross-biological modal language models can collaborate to improve binding\naffinity prediction. We propose a Co-Former to combine the cross-modal sequence\nand structure information and a bi-scope pre-training strategy for improving\nCo-Former's interaction understanding. Meanwhile, we build the largest\nprotein-RNA binding affinity dataset PRA310 for performance evaluation. We also\ntest our model on a public dataset for mutation effect prediction. CoPRA\nreaches state-of-the-art performance on all the datasets. We provide extensive\nanalyses and verify that CoPRA can (1) accurately predict the protein-RNA\nbinding affinity; (2) understand the binding affinity change caused by\nmutations; and (3) benefit from scaling data and model size.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Accurately measuring protein-RNA binding affinity is crucial in many
biological processes and drug design. Previous computational methods for
protein-RNA binding affinity prediction rely on either sequence or structure
features, unable to capture the binding mechanisms comprehensively. The recent
emerging pre-trained language models trained on massive unsupervised sequences
of protein and RNA have shown strong representation ability for various
in-domain downstream tasks, including binding site prediction. However,
applying different-domain language models collaboratively for complex-level
tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained
language models from different biological domains via Complex structure for
Protein-RNA binding Affinity prediction. We demonstrate for the first time that
cross-biological modal language models can collaborate to improve binding
affinity prediction. We propose a Co-Former to combine the cross-modal sequence
and structure information and a bi-scope pre-training strategy for improving
Co-Former's interaction understanding. Meanwhile, we build the largest
protein-RNA binding affinity dataset PRA310 for performance evaluation. We also
test our model on a public dataset for mutation effect prediction. CoPRA
reaches state-of-the-art performance on all the datasets. We provide extensive
analyses and verify that CoPRA can (1) accurately predict the protein-RNA
binding affinity; (2) understand the binding affinity change caused by
mutations; and (3) benefit from scaling data and model size.