An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space.

IF 4.3 2区生物学

PLoS Computational Biology Pub Date : 2023-08-16 eCollection Date: 2023-08-01 DOI:10.1371/journal.pcbi.1010881

Bastian Volker Helmut Hornung, Nicolas Terrapon

{"title":"An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space.","authors":"Bastian Volker Helmut Hornung, Nicolas Terrapon","doi":"10.1371/journal.pcbi.1010881","DOIUrl":null,"url":null,"abstract":"<p><p>The deluge of genomic data raises various challenges for computational protein annotation. The definition of superfamilies, based on conserved folds, or of families, showing more recent homology signatures, allow a first categorization of the sequence space. However, for precise functional annotation or the identification of the unexplored parts within a family, a division into subfamilies is essential. As curators of an expert database, the Carbohydrate Active Enzymes database (CAZy), we began, more than 15 years ago, to manually define subfamilies based on phylogeny reconstruction. However, facing the increasing amount of sequence and functional data, we required more scalable and reproducible methods. The recently popularized sequence similarity networks (SSNs), allows to cope with very large families and computation of many subfamily schemes. Still, the choice of the optimal SSN subfamily scheme only relies on expert knowledge so far, without any data-driven guidance from within the network. In this study, we therefore decided to investigate several network properties to determine a criterion which can be used by curators to evaluate the quality of subfamily assignments. The performance of the closeness centrality criterion, a network property to indicate the connectedness within the network, shows high similarity to the decisions of expert curators from eight distinct protein families. Closeness centrality also suggests that in some cases multiple levels of subfamilies could be possible, depending on the granularity of the research question, while it indicates when no subfamily emerged in some family evolution. We finally used closeness centrality to create subfamilies in four families of the CAZy database, providing a finer functional annotation and highlighting subfamilies without biochemically characterized members for potential future discoveries.</p>","PeriodicalId":49688,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10461819/pdf/","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1010881","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/8/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The deluge of genomic data raises various challenges for computational protein annotation. The definition of superfamilies, based on conserved folds, or of families, showing more recent homology signatures, allow a first categorization of the sequence space. However, for precise functional annotation or the identification of the unexplored parts within a family, a division into subfamilies is essential. As curators of an expert database, the Carbohydrate Active Enzymes database (CAZy), we began, more than 15 years ago, to manually define subfamilies based on phylogeny reconstruction. However, facing the increasing amount of sequence and functional data, we required more scalable and reproducible methods. The recently popularized sequence similarity networks (SSNs), allows to cope with very large families and computation of many subfamily schemes. Still, the choice of the optimal SSN subfamily scheme only relies on expert knowledge so far, without any data-driven guidance from within the network. In this study, we therefore decided to investigate several network properties to determine a criterion which can be used by curators to evaluate the quality of subfamily assignments. The performance of the closeness centrality criterion, a network property to indicate the connectedness within the network, shows high similarity to the decisions of expert curators from eight distinct protein families. Closeness centrality also suggests that in some cases multiple levels of subfamilies could be possible, depending on the granularity of the research question, while it indicates when no subfamily emerged in some family evolution. We finally used closeness centrality to create subfamilies in four families of the CAZy database, providing a finer functional annotation and highlighting subfamilies without biochemically characterized members for potential future discoveries.

Abstract Image

查看原文本刊更多论文

评估序列相似性网络的客观标准有助于划分蛋白质家族序列空间。

基因组数据的泛滥给计算蛋白质注释带来了各种挑战。基于保守折叠的超家族或显示最近同源性特征的家族的定义，允许对序列空间进行首次分类。然而，为了精确的功能注释或识别家族中未探索的部分，将其划分为亚科是至关重要的。作为碳水化合物活性酶数据库（CAZy）专家数据库的管理者，我们在15年前开始根据系统发育重建手动定义亚家族。然而，面对越来越多的序列和功能数据，我们需要更具可扩展性和可重复性的方法。最近流行的序列相似性网络（SSNs）允许处理非常大的族和许多子族方案的计算。尽管如此，到目前为止，最佳SSN子家族方案的选择仅依赖于专家知识，而没有来自网络内部的任何数据驱动指导。因此，在这项研究中，我们决定调查几个网络特性，以确定策展人可以用来评估亚家族分配质量的标准。紧密度-中心性标准是一种表示网络内连通性的网络性质，其性能与来自八个不同蛋白质家族的专家策展人的决策高度相似。密切中心性还表明，在某些情况下，根据研究问题的粒度，可能存在多个级别的亚科，而这表明在某些家族进化中何时没有出现亚科。最后，我们使用贴近中心性在CAZy数据库的四个家族中创建了亚家族，提供了更精细的功能注释，并突出了没有生物化学特征成员的亚家族，以供未来的潜在发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology 生物-生化研究方法

CiteScore

7.10

自引率

4.70%

发文量

820

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.

文献相关原料

公司名称	产品信息	采购帮参考价格