一种与序列比对无关的蛋白质分类方法。

Applied bioinformatics Pub Date : 2004-01-01 DOI:10.2165/00822942-200403020-00008

John K Vries, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar

{"title":"一种与序列比对无关的蛋白质分类方法。","authors":"John K Vries, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar","doi":"10.2165/00822942-200403020-00008","DOIUrl":null,"url":null,"abstract":"Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"137-48"},"PeriodicalIF":0.0000,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00008","citationCount":"23","resultStr":"{\"title\":\"A sequence alignment-independent method for protein classification.\",\"authors\":\"John K Vries, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar\",\"doi\":\"10.2165/00822942-200403020-00008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.\",\"PeriodicalId\":87049,\"journal\":{\"name\":\"Applied bioinformatics\",\"volume\":\"3 2-3\",\"pages\":\"137-48\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.2165/00822942-200403020-00008\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2165/00822942-200403020-00008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2165/00822942-200403020-00008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

对快速积累的序列数据体的注释在很大程度上依赖于对蛋白质家族中远程同源物和功能基序的检测。最流行的方法依赖于序列比对。其中包括使用评分矩阵来比较潜在对齐概率与随机机会的程序，以及使用策划的多个对齐来训练剖面隐马尔可夫模型(hmm)的程序。相关的方法依赖于从单个序列中引导多个比对。然而，基于对齐的程序有局限性。他们假设同源片段之间的相邻性是保守的，这在基因重组或水平转移中可能不成立。当序列相似度低于40%时，比对也会变得模糊。这引起了人们对不依赖于对齐的分类方法的兴趣。提出了一种基于4个氨基酸(4克)连续序列分布的无比对分类方法。对4克的兴趣源于一种观察，即几乎所有理论上可能的4克(20(4))都是按自然顺序出现的，而且大多数4克是均匀分布的。这意味着在不相关的序列中随机发现相同的4克的概率很低。我们建立了一个贝叶斯概率模型来检验这一假设。对于pfama和PIR-PSD中的每个蛋白质家族，从最能表征该家族的4克集合中构建一个称为探针的特征向量。在严格的刀切试验中，将来自Pfam-A和PIR-PSD的未知序列与每个家族的探针进行比较。如果概率最高的探测匹配在排序列表中位于首位，则认为分类结果为真阳性。70%的病例实现了这一目标。假阳性分析表明，如果将选定的家族聚类到子集中，准确率可能接近85%。案例研究表明，未知探针和最佳匹配探针之间的4克共同点与PRINTS的功能基序相关。结果表明，从4克模式分析中可以识别出远程同源物和功能基序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A sequence alignment-independent method for protein classification.

Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied bioinformatics

自引率

0.00%

发文量