Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments

ACM/IEEE SC 1999 Conference (SC'99) Pub Date : 1900-01-01 DOI:10.1145/331532.331583

P. Cull, Tai Hsu

{"title":"Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments","authors":"P. Cull, Tai Hsu","doi":"10.1145/331532.331583","DOIUrl":null,"url":null,"abstract":"Approximate string matching is commonly used to align genetic sequences (DNA or RNA) to determine their shared characteristics. Most genetic string matching methods are based on the edit-distance model, which does not provide alignments for inversions and translocations. Recently, a heuristic called the Walking Tree Method [2, 3] has been developed to solve this problem. Unlike other heuristics, it can handle more than one level of inversion, i.e., inversions within inversions. Furthermore, it tends to capture the matched strings' genes while other heuristics fail. There are two versions of the original walking tree heuristics: the score version gives only the alignment score, the alignment version gives both the score and the alignment mapping between the strings. The score version runs in quadratic time and uses linear space while the alignment version uses an extra log factor for time and space. In this paper, we will briefly describe the walking tree method and the original sequential and parallel algorithms. We will explain why different parallel algorithms are needed for a network of workstations rather than the original algorithm which worked well on a symmetric multi-processor. Our improved parallel method also led to a quadratic time sequential algorithm that uses less space. We give an example of our parallel method. We describe several experiments that show speedup linear in the number of processors, but eventual drop off in speedup as the communication network saturates. For big enough strings, we found linear speedup for all processors we had available. These results suggest that our improved parallel method will scale up as both the size of the problem and the number of processors increase. We include two figures that use real biological data and show that the walking tree methods can find translocations and inversions in DNA sequences and also discover unknown genes.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 1999 Conference (SC'99)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/331532.331583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Approximate string matching is commonly used to align genetic sequences (DNA or RNA) to determine their shared characteristics. Most genetic string matching methods are based on the edit-distance model, which does not provide alignments for inversions and translocations. Recently, a heuristic called the Walking Tree Method [2, 3] has been developed to solve this problem. Unlike other heuristics, it can handle more than one level of inversion, i.e., inversions within inversions. Furthermore, it tends to capture the matched strings' genes while other heuristics fail. There are two versions of the original walking tree heuristics: the score version gives only the alignment score, the alignment version gives both the score and the alignment mapping between the strings. The score version runs in quadratic time and uses linear space while the alignment version uses an extra log factor for time and space. In this paper, we will briefly describe the walking tree method and the original sequential and parallel algorithms. We will explain why different parallel algorithms are needed for a network of workstations rather than the original algorithm which worked well on a symmetric multi-processor. Our improved parallel method also led to a quadratic time sequential algorithm that uses less space. We give an example of our parallel method. We describe several experiments that show speedup linear in the number of processors, but eventual drop off in speedup as the communication network saturates. For big enough strings, we found linear speedup for all processors we had available. These results suggest that our improved parallel method will scale up as both the size of the problem and the number of processors increase. We include two figures that use real biological data and show that the walking tree methods can find translocations and inversions in DNA sequences and also discover unknown genes.

查看原文本刊更多论文

生物序列比对的改进并行和顺序行走树方法

近似字符串匹配通常用于排列基因序列(DNA或RNA)以确定它们的共同特征。大多数的基因串匹配方法都是基于编辑距离模型，不提供逆序和易位的比对。最近，一种称为行走树法的启发式方法[2,3]被开发出来解决这个问题。与其他启发式方法不同，它可以处理多个层次的反转，即反转中的反转。此外，当其他启发式方法失败时，它倾向于捕获匹配字符串的基因。原始的遍历树启发式有两个版本:分数版本只给出对齐分数，而对齐版本同时给出分数和字符串之间的对齐映射。分数版本在二次时间内运行，并使用线性空间，而对齐版本使用额外的时间和空间对数因子。在本文中，我们将简要介绍行走树方法和原始的顺序和并行算法。我们将解释为什么工作站网络需要不同的并行算法，而不是在对称多处理器上运行良好的原始算法。我们改进的并行方法也导致二次时间序列算法，使用更少的空间。我们给出了并行方法的一个例子。我们描述了几个实验，表明处理器数量的加速呈线性增长，但随着通信网络饱和，加速最终会下降。对于足够大的字符串，我们发现所有可用的处理器都有线性加速。这些结果表明，我们改进的并行方法将随着问题的大小和处理器数量的增加而扩大。我们包括两个使用真实生物学数据的图，并表明行走树方法可以发现DNA序列中的易位和倒位，也可以发现未知基因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM/IEEE SC 1999 Conference (SC'99)

自引率

0.00%

发文量