Efficient exact algorithms for LDD motif search

IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences Pub Date : 2017-10-01 DOI:10.1109/ICCABS.2017.8114294

Peng Xiao, S. Rajasekaran

{"title":"Efficient exact algorithms for LDD motif search","authors":"Peng Xiao, S. Rajasekaran","doi":"10.1109/ICCABS.2017.8114294","DOIUrl":null,"url":null,"abstract":"Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (1, d)-motif model is one of these that has been studied widely. In this model, there are n input sequences and each has a length of m. Input are also two integers I and d. The (I, d)-motif search (LDMS) problem is to identify all the strings (called (I, d)-motifs) of length 1 that occur in each of the sequences within a hamming distance of d. However, this requirement might be unnecessarily stringent. We interpret a motif as a biologically significant entity that is evolutionarily preserved (within some distance). It may be highly improbable that the motif undergoes the same number of changes in each of the species. If d is the maximum number of changes that have occurred in a motif, then it is very likely that the number of mutations in one or more of the species is (possibly much) less than d. To account for this possibility we introduce a new model of motif in this paper. This model is called the (l, d 1 , d 2 )-motif model and is defined as follows. Input are n sequences each of length m and three integers l, d 1 , and d 2 , where d2 1 . The (l, d 1 , d 2 )-motif search (LDDMS) problem is to identify all the strings M (called (l, d 1 , d 2 ) motifs) of length l each such that M occurs in all the input sequences within a hamming distance of d1 and it occurs in at least one of the input sequences within a hamming distance of d 2 . This model is more general than the (l, d)-motif model and hence is NP-hard as well.","PeriodicalId":89933,"journal":{"name":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","volume":"14 5 1","pages":"1"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCABS.2017.8114294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (1, d)-motif model is one of these that has been studied widely. In this model, there are n input sequences and each has a length of m. Input are also two integers I and d. The (I, d)-motif search (LDMS) problem is to identify all the strings (called (I, d)-motifs) of length 1 that occur in each of the sequences within a hamming distance of d. However, this requirement might be unnecessarily stringent. We interpret a motif as a biologically significant entity that is evolutionarily preserved (within some distance). It may be highly improbable that the motif undergoes the same number of changes in each of the species. If d is the maximum number of changes that have occurred in a motif, then it is very likely that the number of mutations in one or more of the species is (possibly much) less than d. To account for this possibility we introduce a new model of motif in this paper. This model is called the (l, d 1 , d 2 )-motif model and is defined as follows. Input are n sequences each of length m and three integers l, d 1 , and d 2 , where d2 1 . The (l, d 1 , d 2 )-motif search (LDDMS) problem is to identify all the strings M (called (l, d 1 , d 2 ) motifs) of length l each such that M occurs in all the input sequences within a hamming distance of d1 and it occurs in at least one of the input sequences within a hamming distance of d 2 . This model is more general than the (l, d)-motif model and hence is NP-hard as well.

查看原文本刊更多论文

高效精确的LDD基序搜索算法

基序是一种重要的模式，有许多应用，包括转录因子及其结合位点的识别，复合调控模式，蛋白质家族之间的相似性等。文献中提出了几种母题模型。(1, d)-基序模型是其中一个被广泛研究的模型。在该模型中，有n个输入序列，每个序列的长度为m。输入也是两个整数I和d。(I, d)-motif搜索(LDMS)问题是识别在汉明距离d内每个序列中出现的长度为1的所有字符串(称为(I, d)-motifs)。然而，这个要求可能是不必要的严格。我们将基序解释为进化保存(在一定距离内)的具有生物学意义的实体。在每个物种中，基序经历相同数量的变化可能是极不可能的。如果d是基序中发生的最大变化数，那么很可能一个或多个物种的突变数(可能远远)小于d。为了解释这种可能性，我们在本文中引入了一个新的基序模型。该模型称为(1,d1, d2)-motif模型，定义如下:输入n个长度为m的序列和三个整数l, d1和d2，其中d2为1。(1, d1, d2)-motif搜索(LDDMS)问题是识别所有长度为l的字符串M(称为(1,d1, d2) motif)，使得M出现在汉明距离d1内的所有输入序列中，并且它出现在汉明距离d2内的至少一个输入序列中。该模型比(1,d)基序模型更通用，因此也是NP-hard模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences

自引率

0.00%

发文量