SPACE: STRING proteins as complementary embeddings.

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-01 DOI:10.1093/bioinformatics/btaf496

Dewei Hu, Damian Szklarczyk, Christian von Mering, Lars Juhl Jensen

{"title":"SPACE: STRING proteins as complementary embeddings.","authors":"Dewei Hu, Damian Szklarczyk, Christian von Mering, Lars Juhl Jensen","doi":"10.1093/bioinformatics/btaf496","DOIUrl":null,"url":null,"abstract":"Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.Results: We leveraged the STRING database of protein networks and orthology relations for 1322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of sequence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods.Availability and implementation: The source code and scripts for generating the network-based cross-species protein embeddings are available at https://github.com/deweihu96/SPACE. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453690/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of sequence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods.

Availability and implementation: The source code and scripts for generating the network-based cross-species protein embeddings are available at https://github.com/deweihu96/SPACE. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).

查看原文本刊更多论文

空间：作为互补嵌入的字符串蛋白质。

动机：表征学习彻底改变了基于序列的蛋白质功能预测和亚细胞定位。蛋白质网络是与序列互补的重要信息来源，但在机器学习的背景下，特别是在跨物种环境中，蛋白质网络的使用已被证明是具有挑战性的。结果：我们利用1322种真核生物的蛋白质网络和同源关系的STRING数据库来生成基于网络的跨物种蛋白质嵌入。为此，我们首先创建了特定物种的网络嵌入，然后根据同源关系对它们进行对齐，以促进直接的跨物种比较。我们表明，与特定物种的网络嵌入相比，这些对齐的网络嵌入确保了物种之间的一致性，而不会牺牲质量。我们还表明，尽管在对齐过程中使用了基于序列的正交关系，但对齐网络嵌入与序列嵌入技术是互补的。最后，我们通过将它们用于两个既定的任务：亚细胞定位预测和蛋白质功能预测来验证嵌入。在对齐网络嵌入和序列嵌入上训练逻辑回归分类器比单独使用序列嵌入提高了准确性，达到了接近最先进的深度学习方法的性能数字。可用性和实现：生成基于网络的跨物种蛋白质嵌入的源代码和脚本可在https://github.com/deweihu96/SPACE上获得。所有真核蛋白的预先计算的网络嵌入和序列嵌入都包含在STRING版本12.0中（https://string-db.org/cgi/download）.Supplementary信息：补充数据可在Bioinformatics在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量