How representative are the known structures of the proteins in a complete genome? A comprehensive structural census

Folding & design Pub Date : 1998-11-01 DOI:10.1016/S1359-0278(98)00066-2

Mark Gerstein

{"title":"How representative are the known structures of the proteins in a complete genome? A comprehensive structural census","authors":"Mark Gerstein","doi":"10.1016/S1359-0278(98)00066-2","DOIUrl":null,"url":null,"abstract":"<div><p><strong>Background:</strong> Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation.</p><p><strong>Results:</strong> The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of ‘biophysical proteins’ on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense.</p><p><strong>Conclusions:</strong>The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and <em>Haemophilus influenzae</em> and <em>Helicobacter pylori</em> more helical structure. Further information is available through the GeneCensus system at <span>http://bioinfo.mbb.yale.edu/genome</span><svg><path></path></svg>.</p></div>","PeriodicalId":79488,"journal":{"name":"Folding & design","volume":"3 6","pages":"Pages 497-512"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/S1359-0278(98)00066-2","citationCount":"129","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Folding & design","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1359027898000662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 129

Abstract

Background: Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation.

Results: The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of ‘biophysical proteins’ on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense.

Conclusions:The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.

查看原文本刊更多论文

在一个完整的基因组中，已知的蛋白质结构有多大的代表性?全面的结构性普查

背景:确定由完整基因组编码的蛋白质的已知结构的代表性，对于评估我们目前对蛋白质稳定性和折叠的看法在多大程度上受到结构数据库(PDB)偏差的过度影响是很重要的。这对于改进基于数据库的结构预测和基因组注释方法也很重要。结果:将已知结构与8个完整微生物基因组编码的蛋白质在序列长度、组成和二级结构等方面进行简单统计比较。已知结构由来自PDB的非同源结构域的集合和折叠实验集中的较小的“生物物理蛋白质”列表表示。基因组编码的蛋白质被视为一个整体，并被划分为不同的区域，如已知结构同源、低复杂性(非小叶)、跨膜或连接体。在实际和统计意义上进行了各种测试，以评估所报告的差异的重要性。结论:这些基因组编码的蛋白质与PDB中编码的蛋白质有显著差异。它们的序列长度遵循极值分布，比PDB蛋白长，比生物物理蛋白长得多。它们的组成与PDB蛋白的不同之处在于含有较多的赖氨酸、赖氨酸、Asn和Gln，而较少的赖氨酸和色氨酸。总的来说，这是正确的，特别是对于可溶性蛋白的未知折叠对应的区域。二级结构预测表明，这些未表征区比PDB平均含有更多的螺旋结构;这一平均值的差异很小，酵母有更多的片状结构，流感嗜血杆菌和幽门螺杆菌更多的螺旋结构。更多信息可通过GeneCensus系统获取，网址为http://bioinfo.mbb.yale.edu/genome。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Folding & design

自引率

0.00%

发文量