Improved selection of canonical proteins for reference proteomes.

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2024-06-08 eCollection Date: 2024-06-01 DOI:10.1093/nargab/lqae066

Giuseppe Insana, Maria J Martin, William R Pearson

{"title":"Improved selection of canonical proteins for reference proteomes.","authors":"Giuseppe Insana, Maria J Martin, William R Pearson","doi":"10.1093/nargab/lqae066","DOIUrl":null,"url":null,"abstract":"<p><p>The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae066"},"PeriodicalIF":4.0000,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165316/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

查看原文本刊更多论文

改进参考蛋白质组的典型蛋白质选择。

UniProt 发布的 "典型 "蛋白质集被广泛用于相似性搜索以及功能和结构注释。对许多研究人员来说，典型序列是蛋白质的唯一研究对象。然而，高等真核生物往往从一个基因中编码多种蛋白质同工型。对于未审查的（UniProtKB/TrEMBL）蛋白质序列，以基因为中心的组中最长的序列被选为标准序列。这种选择可能会造成不一致，选择出长度相差很大但>95%相同的直向同源物，而这在生物学上是不可能的。我们介绍了 ortho2tree 管道，它可以检查来自同源蛋白质组的参考蛋白质组同源序列和异构体序列，建立多重比对，构建间距树，并识别长度相似的低成本异构体支系。在研究了 UniProtKB 第 2022_05 版中来自 8 种哺乳动物的 140,000 个蛋白质后，ortho2tree 为第 2023_01 版提出了 7804 个同源变化，同时确认了 53,434 个同源变化。正交2tree选择的同工酶的间隙分布与细菌和酵母排列中的间隙分布相似，生物体不受同工酶选择的影响，这表明正交2tree同工酶更准确地反映了真正的生物变异。82%的正交树拟议变异与MANE一致；92%的确认同义词与MANE一致。Ortho2tree 可以改进相同度大于 60% 的直向同源序列（包括脊椎动物和高等植物）的典型分配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊