BLAST Tree: Fast Filtering for Genomic Sequence Classification

2010 IEEE International Conference on BioInformatics and BioEngineering Pub Date : 2010-05-31 DOI:10.1109/BIBE.2010.74

Stuart King, Yanni Sun, James R. Cole, S. Pramanik

{"title":"BLAST Tree: Fast Filtering for Genomic Sequence Classification","authors":"Stuart King, Yanni Sun, James R. Cole, S. Pramanik","doi":"10.1109/BIBE.2010.74","DOIUrl":null,"url":null,"abstract":"With the advent of next-generation sequencing and culture-independent methods, we now are accumulating an enormous amount of metagenomic data from microbial communities. These data sets are large, hard to assemble, and might encode rare or novel proteins, posing new computational challenges for protein homology search. This paper presents a novel protein homology search algorithm that combines the salient features of pairwise sequence alignment programs such as Blast and protein family based tools such as Hmmer. It is optimized for protein annotation in metagenomic data sets because: 1) it is fast, 2) it can classify short protein fragments encoded by individual sequence reads, 3) it can find homologs to novel or rare protein families when there is not enough member sequences to build a probabilistic model. Our algorithm builds a new indexing data structure called BlastTree, which can index a large sequence family database because of our effective compression techniques. In addition, BlastTree fully exploits sequence family membership information to improve homology search sensitivity. When the BlastTree Search algorithm is incorporated into Hmmer, it runs in a fraction of the time with comparable quality.","PeriodicalId":330904,"journal":{"name":"2010 IEEE International Conference on BioInformatics and BioEngineering","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on BioInformatics and BioEngineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2010.74","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

With the advent of next-generation sequencing and culture-independent methods, we now are accumulating an enormous amount of metagenomic data from microbial communities. These data sets are large, hard to assemble, and might encode rare or novel proteins, posing new computational challenges for protein homology search. This paper presents a novel protein homology search algorithm that combines the salient features of pairwise sequence alignment programs such as Blast and protein family based tools such as Hmmer. It is optimized for protein annotation in metagenomic data sets because: 1) it is fast, 2) it can classify short protein fragments encoded by individual sequence reads, 3) it can find homologs to novel or rare protein families when there is not enough member sequences to build a probabilistic model. Our algorithm builds a new indexing data structure called BlastTree, which can index a large sequence family database because of our effective compression techniques. In addition, BlastTree fully exploits sequence family membership information to improve homology search sensitivity. When the BlastTree Search algorithm is incorporated into Hmmer, it runs in a fraction of the time with comparable quality.

查看原文本刊更多论文

BLAST树:基因组序列分类的快速过滤

随着下一代测序和培养独立方法的出现，我们现在正在从微生物群落中积累大量的宏基因组数据。这些数据集很大，难以组装，并且可能编码罕见或新的蛋白质，这对蛋白质同源性搜索提出了新的计算挑战。本文提出了一种新的蛋白质同源搜索算法，该算法结合了成对序列比对程序(如Blast)和基于蛋白质家族的工具(如Hmmer)的显著特征。它对宏基因组数据集中的蛋白质注释进行了优化，因为:1)速度快，2)它可以对单个序列reads编码的短蛋白质片段进行分类，3)当成员序列不足时，它可以找到新的或罕见的蛋白质家族的同源物来建立概率模型。我们的算法建立了一个新的索引数据结构，称为BlastTree，由于我们有效的压缩技术，它可以索引大型序列族数据库。此外，BlastTree充分利用序列家族成员信息，提高同源性搜索的灵敏度。当BlastTree Search算法被整合到Hmmer中时，它可以在相当短的时间内以相当的质量运行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on BioInformatics and BioEngineering

自引率

0.00%

发文量