An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI:10.1109/ICPP.2008.79

Dmitry Korkin, Qingguo Wang, Yi Shang

{"title":"An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem","authors":"Dmitry Korkin, Qingguo Wang, Yi Shang","doi":"10.1109/ICPP.2008.79","DOIUrl":null,"url":null,"abstract":"Finding the multiple longest common subsequence (MLCS) is an important problem in the areas of bioinformatics and computational genomics. Approaches that are more efficient than the standard dynamic programming method have been introduced and successfully parallelized for the special cases of 2 sequences. However, the increasing complexity and size of biological data require an efficient method applicable to an arbitrary number of sequences as well as its efficient parallelization. A recently developed dominant points method for a general MLCS problem has been shown a significant performance improvement over the dynamic programming method, when number of sequences is larger than two. At the same time, the approach has revealed strong demand for its parallelization, in order to be applied to the larger families of sequences or sequences of the greater lengths. In this paper, we introduce an efficient parallel algorithm to find a MLCS for an arbitrary number of sequences, which is based on the dominant points method. When the number of processors is not greater than the size of alphabet multiplied by the number of sequences, the parallel algorithm is estimated to have the asymptotically linear speed up. We experimentally tested the algorithm using sets of randomly generated sequences over different alphabets as well as the protein sequences from a family of homologous proteins. We found that the performance of the algorithm increases with the number of input sequences and reaches a near-linear speedup for eight sequences.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 37th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2008.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Finding the multiple longest common subsequence (MLCS) is an important problem in the areas of bioinformatics and computational genomics. Approaches that are more efficient than the standard dynamic programming method have been introduced and successfully parallelized for the special cases of 2 sequences. However, the increasing complexity and size of biological data require an efficient method applicable to an arbitrary number of sequences as well as its efficient parallelization. A recently developed dominant points method for a general MLCS problem has been shown a significant performance improvement over the dynamic programming method, when number of sequences is larger than two. At the same time, the approach has revealed strong demand for its parallelization, in order to be applied to the larger families of sequences or sequences of the greater lengths. In this paper, we introduce an efficient parallel algorithm to find a MLCS for an arbitrary number of sequences, which is based on the dominant points method. When the number of processors is not greater than the size of alphabet multiplied by the number of sequences, the parallel algorithm is estimated to have the asymptotically linear speed up. We experimentally tested the algorithm using sets of randomly generated sequences over different alphabets as well as the protein sequences from a family of homologous proteins. We found that the performance of the algorithm increases with the number of input sequences and reaches a near-linear speedup for eight sequences.

查看原文本刊更多论文

多最长公共子序列(MLCS)问题的一种高效并行算法

寻找多个最长公共子序列(MLCS)是生物信息学和计算基因组学领域的一个重要问题。本文介绍了一种比标准动态规划方法更有效的方法，并成功地对两个序列的特殊情况进行了并行化处理。然而，随着生物数据的复杂性和规模的增加，需要一种适用于任意数量序列的有效方法及其有效的并行化。最近提出的一种针对一般MLCS问题的优势点方法，在序列数大于2的情况下，比动态规划法的性能有了显著的提高。同时，为了应用于更大的序列族或更长的序列，对该方法的并行化提出了强烈的要求。本文介绍了一种基于优势点法的求任意数目序列的MLCS的高效并行算法。当处理器数量不大于字母表大小乘以序列数时，估计并行算法具有渐近线性加速。我们通过实验测试了该算法，使用不同字母表上随机生成的序列集以及来自同源蛋白家族的蛋白质序列。我们发现，算法的性能随着输入序列的增加而增加，并且在8个序列时达到近线性加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 37th International Conference on Parallel Processing

自引率

0.00%

发文量