Rprot-Vec：一种快速蛋白质结构相似度计算的深度学习方法。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-07-10 DOI:10.1186/s12859-025-06213-1

Yichuan Zhang, Wen Zhang

{"title":"Rprot-Vec：一种快速蛋白质结构相似度计算的深度学习方法。","authors":"Yichuan Zhang, Wen Zhang","doi":"10.1186/s12859-025-06213-1","DOIUrl":null,"url":null,"abstract":"Background: Predicting protein structural similarity and detecting homologous sequences remain fundamental and challenging tasks in computational biology. Accurate identification of structural homologs enables function inference for newly discovered or unannotated proteins. Traditional approaches often require full 3D structural data, which is unavailable for most proteins. Thus, there is a need for sequence-based methods capable of inferring structural similarity efficiently and at scale.Result: We present Rprot-Vec (Rapid Protein Vector), a deep learning model that predicts protein structural similarity and performs homology detection using only primary sequence data. The model integrates bidirectional GRU and multi-scale CNN layers with ProtT5-based encoding, enabling accurate and fast similarity estimation. Rprot-Vec achieves a 65.3% accurate similarity prediction rate in the homologous region (TM-score > 0.8), with an average prediction error of 0.0561 across all TM-score intervals. Despite having only 41% of the parameters of TM-vec, Rprot-Vec outperforms both public and locally trained TM-vec baselines in all tested settings. Additionally, we constructed and released three curated training datasets (CATH_TM_score_S/M/L), supporting further research in this area.Conclusion: Rprot-Vec offers a fast and lightweight solution for sequence-based structural similarity prediction. It can be applied in protein homology detection, structure-function inference, drug repurposing, and other downstream biological tasks. Its open-source availability and released datasets facilitate broader adoption and further development by the research community.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"171"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12243341/pdf/","citationCount":"0","resultStr":"{\"title\":\"Rprot-Vec: a deep learning approach for fast protein structure similarity calculation.\",\"authors\":\"Yichuan Zhang, Wen Zhang\",\"doi\":\"10.1186/s12859-025-06213-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Predicting protein structural similarity and detecting homologous sequences remain fundamental and challenging tasks in computational biology. Accurate identification of structural homologs enables function inference for newly discovered or unannotated proteins. Traditional approaches often require full 3D structural data, which is unavailable for most proteins. Thus, there is a need for sequence-based methods capable of inferring structural similarity efficiently and at scale.Result: We present Rprot-Vec (Rapid Protein Vector), a deep learning model that predicts protein structural similarity and performs homology detection using only primary sequence data. The model integrates bidirectional GRU and multi-scale CNN layers with ProtT5-based encoding, enabling accurate and fast similarity estimation. Rprot-Vec achieves a 65.3% accurate similarity prediction rate in the homologous region (TM-score > 0.8), with an average prediction error of 0.0561 across all TM-score intervals. Despite having only 41% of the parameters of TM-vec, Rprot-Vec outperforms both public and locally trained TM-vec baselines in all tested settings. Additionally, we constructed and released three curated training datasets (CATH_TM_score_S/M/L), supporting further research in this area.Conclusion: Rprot-Vec offers a fast and lightweight solution for sequence-based structural similarity prediction. It can be applied in protein homology detection, structure-function inference, drug repurposing, and other downstream biological tasks. Its open-source availability and released datasets facilitate broader adoption and further development by the research community.\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":\"26 1\",\"pages\":\"171\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12243341/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-025-06213-1\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06213-1","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：预测蛋白质结构相似性和检测同源序列是计算生物学的基础和具有挑战性的任务。结构同源物的准确鉴定使新发现或未注释的蛋白质的功能推断成为可能。传统的方法通常需要完整的3D结构数据，这对于大多数蛋白质来说是不可用的。因此，需要基于序列的方法，能够有效和大规模地推断结构相似性。结果：我们提出了Rprot-Vec（快速蛋白质载体），这是一种深度学习模型，可以预测蛋白质结构相似性并仅使用初级序列数据进行同源性检测。该模型将双向GRU和多尺度CNN层与基于prott5的编码相结合，实现了准确快速的相似度估计。Rprot-Vec在同源区（TM-score >.8）的相似度预测准确率为65.3%，在所有TM-score区间的平均预测误差为0.0561。尽管Rprot-Vec的参数只有TM-vec的41%，但在所有测试环境中，Rprot-Vec都优于公共和本地训练的TM-vec基线。此外，我们构建并发布了三个精心策划的训练数据集（CATH_TM_score_S/M/L），为该领域的进一步研究提供支持。结论：Rprot-Vec为基于序列的结构相似性预测提供了快速、轻量级的解决方案。它可以应用于蛋白质同源性检测、结构功能推断、药物再利用和其他下游生物任务。它的开源可用性和发布的数据集促进了研究社区更广泛的采用和进一步发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Rprot-Vec: a deep learning approach for fast protein structure similarity calculation.

查看原文本刊更多论文

Rprot-Vec: a deep learning approach for fast protein structure similarity calculation.

Background: Predicting protein structural similarity and detecting homologous sequences remain fundamental and challenging tasks in computational biology. Accurate identification of structural homologs enables function inference for newly discovered or unannotated proteins. Traditional approaches often require full 3D structural data, which is unavailable for most proteins. Thus, there is a need for sequence-based methods capable of inferring structural similarity efficiently and at scale.

Result: We present Rprot-Vec (Rapid Protein Vector), a deep learning model that predicts protein structural similarity and performs homology detection using only primary sequence data. The model integrates bidirectional GRU and multi-scale CNN layers with ProtT5-based encoding, enabling accurate and fast similarity estimation. Rprot-Vec achieves a 65.3% accurate similarity prediction rate in the homologous region (TM-score > 0.8), with an average prediction error of 0.0561 across all TM-score intervals. Despite having only 41% of the parameters of TM-vec, Rprot-Vec outperforms both public and locally trained TM-vec baselines in all tested settings. Additionally, we constructed and released three curated training datasets (CATH_TM_score_S/M/L), supporting further research in this area.

Conclusion: Rprot-Vec offers a fast and lightweight solution for sequence-based structural similarity prediction. It can be applied in protein homology detection, structure-function inference, drug repurposing, and other downstream biological tasks. Its open-source availability and released datasets facilitate broader adoption and further development by the research community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.