{"title":"How representative are the known structures of the proteins in a complete genome? A comprehensive structural census","authors":"Mark Gerstein","doi":"10.1016/S1359-0278(98)00066-2","DOIUrl":null,"url":null,"abstract":"<div><p><strong>Background:</strong> Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation.</p><p><strong>Results:</strong> The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of ‘biophysical proteins’ on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense.</p><p><strong>Conclusions:</strong>The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and <em>Haemophilus influenzae</em> and <em>Helicobacter pylori</em> more helical structure. Further information is available through the GeneCensus system at <span>http://bioinfo.mbb.yale.edu/genome</span><svg><path></path></svg>.</p></div>","PeriodicalId":79488,"journal":{"name":"Folding & design","volume":"3 6","pages":"Pages 497-512"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/S1359-0278(98)00066-2","citationCount":"129","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Folding & design","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1359027898000662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 129
Abstract
Background: Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation.
Results: The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of ‘biophysical proteins’ on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense.
Conclusions:The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.