{"title":"Optimized Cluster-Enabled HMMER Searches","authors":"J. Walters, Joseph I. Landman, V. Chaudhary","doi":"10.1002/9780470191637.CH3","DOIUrl":null,"url":null,"abstract":"Protein sequence analysis tools to predict homology, structure and function of particular peptide sequences exist in abundance. One of the most commonly used tools is the profile hidden Markov model algorithm developed by Eddy [Eddy, 1998] and coworkers [Durbin et al., 1998]. These tools allow scientists to construct mathematical models (Hidden Markov Models or HMM) of a set of aligned protein sequences with known similar function and homology, which is then applicable to a large database of proteins. The tools provide the ability to generate a log-odds score as to whether or not the protein belongs to the same family as the proteins which generated the HMM, or to a set of random unrelated sequences. Due to the complexity of the calculation, and the possibility to apply many HMM’s to a single sequence (pfam search), these calculations require significant numbers of processing cycles. Efforts to accelerate these searches have resulted in several platform and hardware specific variants including an Altivec port by Lindahl [Lindahl, 2005], a GPU port of hmmsearch by Horn et al. of Stanford [Horn et al., 2005] as well as several optimizations performed by the authors of this chapter. These optimizations span a range between minimal source code changes with some impact upon performance, to recasting the core algorithms in terms of a different computing technology and thus fundamentally altering the calculation. Each approach has specific benefits and costs. Detailed descriptions of the author’s modifications can also be found in [Walters et al., 2006, Landman et al., 2006]. The remainder of this chapter is organized as follows: in section 1.2 we give a brief overview of HMMER and the underlying plan-7 architecture. In section 1.3 we discuss several different strategies that have been used to implement and accelerate HMMER on a variety of platforms. In section 1.4 we detail our optimizations and provide performance details. We conclude this chapter in section 1.5","PeriodicalId":164785,"journal":{"name":"Grid Computing for Bioinformatics and Computational Biology","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Grid Computing for Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/9780470191637.CH3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Protein sequence analysis tools to predict homology, structure and function of particular peptide sequences exist in abundance. One of the most commonly used tools is the profile hidden Markov model algorithm developed by Eddy [Eddy, 1998] and coworkers [Durbin et al., 1998]. These tools allow scientists to construct mathematical models (Hidden Markov Models or HMM) of a set of aligned protein sequences with known similar function and homology, which is then applicable to a large database of proteins. The tools provide the ability to generate a log-odds score as to whether or not the protein belongs to the same family as the proteins which generated the HMM, or to a set of random unrelated sequences. Due to the complexity of the calculation, and the possibility to apply many HMM’s to a single sequence (pfam search), these calculations require significant numbers of processing cycles. Efforts to accelerate these searches have resulted in several platform and hardware specific variants including an Altivec port by Lindahl [Lindahl, 2005], a GPU port of hmmsearch by Horn et al. of Stanford [Horn et al., 2005] as well as several optimizations performed by the authors of this chapter. These optimizations span a range between minimal source code changes with some impact upon performance, to recasting the core algorithms in terms of a different computing technology and thus fundamentally altering the calculation. Each approach has specific benefits and costs. Detailed descriptions of the author’s modifications can also be found in [Walters et al., 2006, Landman et al., 2006]. The remainder of this chapter is organized as follows: in section 1.2 we give a brief overview of HMMER and the underlying plan-7 architecture. In section 1.3 we discuss several different strategies that have been used to implement and accelerate HMMER on a variety of platforms. In section 1.4 we detail our optimizations and provide performance details. We conclude this chapter in section 1.5
用于预测特定肽序列同源性、结构和功能的蛋白质序列分析工具大量存在。最常用的工具之一是由Eddy [Eddy, 1998]及其同事[Durbin et al., 1998]开发的配置文件隐马尔可夫模型算法。这些工具允许科学家构建一组已知具有相似功能和同源性的排列蛋白质序列的数学模型(隐马尔可夫模型或HMM),然后将其应用于大型蛋白质数据库。这些工具提供了生成对数赔率分数的能力,以确定该蛋白质是否与生成HMM的蛋白质属于同一家族,还是属于一组随机的不相关序列。由于计算的复杂性,以及对单个序列应用多个HMM的可能性(pam搜索),这些计算需要大量的处理周期。加速这些搜索的努力导致了几个平台和硬件特定的变体,包括Lindahl的Altivec端口[Lindahl, 2005],斯坦福大学Horn等人的hmmsearch的GPU端口[Horn等人,2005],以及本章作者进行的几次优化。这些优化的范围从对性能有一定影响的最小源代码更改,到根据不同的计算技术重新构建核心算法,从而从根本上改变计算。每种方法都有特定的收益和成本。作者修改的详细描述也可以在[Walters et al., 2006, Landman et al., 2006]中找到。本章的其余部分组织如下:在1.2节中,我们简要概述了hmm和底层的plan-7体系结构。在1.3节中,我们将讨论几种不同的策略,用于在各种平台上实现和加速hmm。在1.4节中,我们将详细介绍我们的优化和性能细节。我们在第1.5节结束本章