An Efficient Algorithm for String Motif Discovery

Proceedings of the ... Asia-Pacific bioinformatics conference Pub Date : 2005-12-01 DOI:10.1142/9781860947292_0011

Francis Y. L. Chin, Henry C. M. Leung

{"title":"An Efficient Algorithm for String Motif Discovery","authors":"Francis Y. L. Chin, Henry C. M. Leung","doi":"10.1142/9781860947292_0011","DOIUrl":null,"url":null,"abstract":"Finding common patterns, motifs, in a set of DNA sequences is an important problem in bioinformatics. One common representation of motifs is a string with symbols A, C, G, T and N where N stands for the wildcard symbol. In this paper, we introduce a more general motif discovery problem without any weaknesses of the Planted (l,d)-Motif Problem and also a set of control sequences as an additional input. The existing algorithms using brute force approach for solving similar problem take O(n(t+f)l5) times where t and f are the number of input sequences and control sequences respectively, n is the length of each sequence and l is the length of the motif. We propose an efficient algorithm, called VAS, which has an expected running time O(nfl(nt)(4+1/4)) using O((nt)(4+1/4)) space for any integer k. In particular when k = 3, the time and space complexities are O(nlf (nt)(1.0625)) and O((nt)(1.0625)) respectively. This algorithm makes use of voting and graph representation for better time and space complexities. This technique can also be used to improve the performances of some existing algorithms.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"6 1","pages":"79-88"},"PeriodicalIF":0.0000,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... Asia-Pacific bioinformatics conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860947292_0011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Finding common patterns, motifs, in a set of DNA sequences is an important problem in bioinformatics. One common representation of motifs is a string with symbols A, C, G, T and N where N stands for the wildcard symbol. In this paper, we introduce a more general motif discovery problem without any weaknesses of the Planted (l,d)-Motif Problem and also a set of control sequences as an additional input. The existing algorithms using brute force approach for solving similar problem take O(n(t+f)l5) times where t and f are the number of input sequences and control sequences respectively, n is the length of each sequence and l is the length of the motif. We propose an efficient algorithm, called VAS, which has an expected running time O(nfl(nt)(4+1/4)) using O((nt)(4+1/4)) space for any integer k. In particular when k = 3, the time and space complexities are O(nlf (nt)(1.0625)) and O((nt)(1.0625)) respectively. This algorithm makes use of voting and graph representation for better time and space complexities. This technique can also be used to improve the performances of some existing algorithms.

查看原文本刊更多论文

一种高效的字符串基序发现算法

在一组DNA序列中寻找共同的模式，基序是生物信息学中的一个重要问题。图案的一种常见表示是带有符号a、C、G、T和N的字符串，其中N代表通配符符号。在本文中，我们引入了一个更一般的基序发现问题，该问题没有planded -Motif问题的任何弱点，并且还引入了一组控制序列作为附加输入。现有的使用蛮力方法求解类似问题的算法需要O(n(t+f) 15)次，其中t和f分别是输入序列和控制序列的个数，n是每个序列的长度，l是motif的长度。我们提出了一种高效的算法，称为VAS，它对任意整数k使用O((nt)(4+1/4))空间的期望运行时间为O(nfl(nt)(4+1/4))。特别是当k = 3时，时间和空间复杂度分别为O(nlf (nt)(1.0625))和O((nt)(1.0625))。该算法利用投票和图形表示来提高时间和空间复杂度。这种技术也可以用来提高一些现有算法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... Asia-Pacific bioinformatics conference

自引率

0.00%

发文量