{"title":"CoPRA:将跨域预训练序列模型与复杂结构衔接起来,用于蛋白质-RNA 结合亲和力预测","authors":"Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen","doi":"arxiv-2409.03773","DOIUrl":null,"url":null,"abstract":"Accurately measuring protein-RNA binding affinity is crucial in many\nbiological processes and drug design. Previous computational methods for\nprotein-RNA binding affinity prediction rely on either sequence or structure\nfeatures, unable to capture the binding mechanisms comprehensively. The recent\nemerging pre-trained language models trained on massive unsupervised sequences\nof protein and RNA have shown strong representation ability for various\nin-domain downstream tasks, including binding site prediction. However,\napplying different-domain language models collaboratively for complex-level\ntasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained\nlanguage models from different biological domains via Complex structure for\nProtein-RNA binding Affinity prediction. We demonstrate for the first time that\ncross-biological modal language models can collaborate to improve binding\naffinity prediction. We propose a Co-Former to combine the cross-modal sequence\nand structure information and a bi-scope pre-training strategy for improving\nCo-Former's interaction understanding. Meanwhile, we build the largest\nprotein-RNA binding affinity dataset PRA310 for performance evaluation. We also\ntest our model on a public dataset for mutation effect prediction. CoPRA\nreaches state-of-the-art performance on all the datasets. We provide extensive\nanalyses and verify that CoPRA can (1) accurately predict the protein-RNA\nbinding affinity; (2) understand the binding affinity change caused by\nmutations; and (3) benefit from scaling data and model size.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction\",\"authors\":\"Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen\",\"doi\":\"arxiv-2409.03773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurately measuring protein-RNA binding affinity is crucial in many\\nbiological processes and drug design. Previous computational methods for\\nprotein-RNA binding affinity prediction rely on either sequence or structure\\nfeatures, unable to capture the binding mechanisms comprehensively. The recent\\nemerging pre-trained language models trained on massive unsupervised sequences\\nof protein and RNA have shown strong representation ability for various\\nin-domain downstream tasks, including binding site prediction. However,\\napplying different-domain language models collaboratively for complex-level\\ntasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained\\nlanguage models from different biological domains via Complex structure for\\nProtein-RNA binding Affinity prediction. We demonstrate for the first time that\\ncross-biological modal language models can collaborate to improve binding\\naffinity prediction. We propose a Co-Former to combine the cross-modal sequence\\nand structure information and a bi-scope pre-training strategy for improving\\nCo-Former's interaction understanding. Meanwhile, we build the largest\\nprotein-RNA binding affinity dataset PRA310 for performance evaluation. We also\\ntest our model on a public dataset for mutation effect prediction. CoPRA\\nreaches state-of-the-art performance on all the datasets. We provide extensive\\nanalyses and verify that CoPRA can (1) accurately predict the protein-RNA\\nbinding affinity; (2) understand the binding affinity change caused by\\nmutations; and (3) benefit from scaling data and model size.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03773\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction
Accurately measuring protein-RNA binding affinity is crucial in many
biological processes and drug design. Previous computational methods for
protein-RNA binding affinity prediction rely on either sequence or structure
features, unable to capture the binding mechanisms comprehensively. The recent
emerging pre-trained language models trained on massive unsupervised sequences
of protein and RNA have shown strong representation ability for various
in-domain downstream tasks, including binding site prediction. However,
applying different-domain language models collaboratively for complex-level
tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained
language models from different biological domains via Complex structure for
Protein-RNA binding Affinity prediction. We demonstrate for the first time that
cross-biological modal language models can collaborate to improve binding
affinity prediction. We propose a Co-Former to combine the cross-modal sequence
and structure information and a bi-scope pre-training strategy for improving
Co-Former's interaction understanding. Meanwhile, we build the largest
protein-RNA binding affinity dataset PRA310 for performance evaluation. We also
test our model on a public dataset for mutation effect prediction. CoPRA
reaches state-of-the-art performance on all the datasets. We provide extensive
analyses and verify that CoPRA can (1) accurately predict the protein-RNA
binding affinity; (2) understand the binding affinity change caused by
mutations; and (3) benefit from scaling data and model size.