全基因组蛋白序列的比较n图分析

IEEE Personal Communications Pub Date : 2002-03-24 DOI:10.3115/1289189.1289259

M. Ganapathiraju, D. Weisser, Roni Rosenfeld, J. Carbonell, Raj Reddy, J. Klein-Seetharaman

{"title":"全基因组蛋白序列的比较n图分析","authors":"M. Ganapathiraju, D. Weisser, Roni Rosenfeld, J. Carbonell, Raj Reddy, J. Klein-Seetharaman","doi":"10.3115/1289189.1289259","DOIUrl":null,"url":null,"abstract":"A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question \"What kind of things do people say?\" we therefore need to ask the question \"What kind of amino acid sequences occur in the proteins of an organism?\" An understanding of the sequence space occupied by proteins in different organisms would have important applications for \"translation\" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. \n \nHere we describe the development of a biological language modeling toolkit (BLMT) for genome-wide statistical amino acid n-gram analysis and comparison across organisms (freely accessible at www.cs.cmu.edu/~blmt). Its functions were applied to 44 different bacterial, archaeal and the human genome. Amino acid n-gram distribution was found to be characteristic of organisms, as evidenced by (1) the ability of simple Markovian unigram models to distinguish organisms, (2) the marked variation in n-gram distributions across organisms above random variation, and (3) identification of organism-specific phrases in protein sequences that are greater than an order of magnitude standard deviations away from the mean. These lines of evidence suggest that different organisms utilize different \"vocabularies\" and \"phrases\", an observation that may provide novel approaches to drug development by specifically targeting these phrases. The results suggest that further detailed analysis of n-gram statistics of protein sequences from whole genomes will likely - in analogy to word n-gram analysis - result in powerful models for prediction, topic classification and information extraction of bilogical sequences.","PeriodicalId":332944,"journal":{"name":"IEEE Personal Communications","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"Comparative n-gram analysis of whole-genome protein sequences\",\"authors\":\"M. Ganapathiraju, D. Weisser, Roni Rosenfeld, J. Carbonell, Raj Reddy, J. Klein-Seetharaman\",\"doi\":\"10.3115/1289189.1289259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question \\\"What kind of things do people say?\\\" we therefore need to ask the question \\\"What kind of amino acid sequences occur in the proteins of an organism?\\\" An understanding of the sequence space occupied by proteins in different organisms would have important applications for \\\"translation\\\" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. \\n \\nHere we describe the development of a biological language modeling toolkit (BLMT) for genome-wide statistical amino acid n-gram analysis and comparison across organisms (freely accessible at www.cs.cmu.edu/~blmt). Its functions were applied to 44 different bacterial, archaeal and the human genome. Amino acid n-gram distribution was found to be characteristic of organisms, as evidenced by (1) the ability of simple Markovian unigram models to distinguish organisms, (2) the marked variation in n-gram distributions across organisms above random variation, and (3) identification of organism-specific phrases in protein sequences that are greater than an order of magnitude standard deviations away from the mean. These lines of evidence suggest that different organisms utilize different \\\"vocabularies\\\" and \\\"phrases\\\", an observation that may provide novel approaches to drug development by specifically targeting these phrases. The results suggest that further detailed analysis of n-gram statistics of protein sequences from whole genomes will likely - in analogy to word n-gram analysis - result in powerful models for prediction, topic classification and information extraction of bilogical sequences.\",\"PeriodicalId\":332944,\"journal\":{\"name\":\"IEEE Personal Communications\",\"volume\":\"61 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Personal Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/1289189.1289259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Personal Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/1289189.1289259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 73

摘要

目前成功合理药物设计的一个障碍是缺乏对细胞中蛋白质所提供的结构空间的理解，而结构空间是由它们的序列空间决定的。能够折叠成蛋白质的功能三维形状的蛋白质序列对于不同的生物体显然是不同的，因为从人类蛋白质中获得的序列往往不能在细菌生物体中形成正确的三维结构。与“人们说什么样的事情?”这个问题类似，我们因此需要问这样一个问题:“生物体的蛋白质中出现了什么样的氨基酸序列?”了解不同生物体中蛋白质所占据的序列空间，对于将蛋白质从一种生物体的语言“翻译”为另一种生物体的语言，以及设计针对致病生物体可能比人类宿主更独特或更喜欢的序列的药物具有重要的应用价值。在这里，我们描述了一个生物语言建模工具包(BLMT)的发展，用于全基因组统计氨基酸n-图分析和跨生物比较(免费访问www.cs.cmu.edu/~blmt)。其功能应用于44种不同的细菌、古细菌和人类基因组。氨基酸n-gram分布被发现是生物体的特征，证明:(1)简单的马尔可夫单图模型能够区分生物体，(2)不同生物体之间n-gram分布的显著差异高于随机差异，以及(3)在蛋白质序列中识别出生物体特异性短语，这些短语与平均值的距离大于一个数量级的标准差。这些证据表明，不同的生物体使用不同的“词汇”和“短语”，这一观察结果可能为专门针对这些短语的药物开发提供新方法。结果表明，对全基因组蛋白质序列的n-gram统计数据进行进一步的详细分析，可能会像单词n-gram分析一样，为生物序列的预测、主题分类和信息提取提供强大的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparative n-gram analysis of whole-genome protein sequences

A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question "What kind of things do people say?" we therefore need to ask the question "What kind of amino acid sequences occur in the proteins of an organism?" An understanding of the sequence space occupied by proteins in different organisms would have important applications for "translation" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. Here we describe the development of a biological language modeling toolkit (BLMT) for genome-wide statistical amino acid n-gram analysis and comparison across organisms (freely accessible at www.cs.cmu.edu/~blmt). Its functions were applied to 44 different bacterial, archaeal and the human genome. Amino acid n-gram distribution was found to be characteristic of organisms, as evidenced by (1) the ability of simple Markovian unigram models to distinguish organisms, (2) the marked variation in n-gram distributions across organisms above random variation, and (3) identification of organism-specific phrases in protein sequences that are greater than an order of magnitude standard deviations away from the mean. These lines of evidence suggest that different organisms utilize different "vocabularies" and "phrases", an observation that may provide novel approaches to drug development by specifically targeting these phrases. The results suggest that further detailed analysis of n-gram statistics of protein sequences from whole genomes will likely - in analogy to word n-gram analysis - result in powerful models for prediction, topic classification and information extraction of bilogical sequences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Personal Communications

自引率

0.00%

发文量