{"title":"Efficient exact algorithms for LDD motif search","authors":"Peng Xiao, S. Rajasekaran","doi":"10.1109/ICCABS.2017.8114294","DOIUrl":null,"url":null,"abstract":"Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (1, d)-motif model is one of these that has been studied widely. In this model, there are n input sequences and each has a length of m. Input are also two integers I and d. The (I, d)-motif search (LDMS) problem is to identify all the strings (called (I, d)-motifs) of length 1 that occur in each of the sequences within a hamming distance of d. However, this requirement might be unnecessarily stringent. We interpret a motif as a biologically significant entity that is evolutionarily preserved (within some distance). It may be highly improbable that the motif undergoes the same number of changes in each of the species. If d is the maximum number of changes that have occurred in a motif, then it is very likely that the number of mutations in one or more of the species is (possibly much) less than d. To account for this possibility we introduce a new model of motif in this paper. This model is called the (l, d 1 , d 2 )-motif model and is defined as follows. Input are n sequences each of length m and three integers l, d 1 , and d 2 , where d2 1 . The (l, d 1 , d 2 )-motif search (LDDMS) problem is to identify all the strings M (called (l, d 1 , d 2 ) motifs) of length l each such that M occurs in all the input sequences within a hamming distance of d1 and it occurs in at least one of the input sequences within a hamming distance of d 2 . This model is more general than the (l, d)-motif model and hence is NP-hard as well.","PeriodicalId":89933,"journal":{"name":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","volume":"14 5 1","pages":"1"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCABS.2017.8114294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (1, d)-motif model is one of these that has been studied widely. In this model, there are n input sequences and each has a length of m. Input are also two integers I and d. The (I, d)-motif search (LDMS) problem is to identify all the strings (called (I, d)-motifs) of length 1 that occur in each of the sequences within a hamming distance of d. However, this requirement might be unnecessarily stringent. We interpret a motif as a biologically significant entity that is evolutionarily preserved (within some distance). It may be highly improbable that the motif undergoes the same number of changes in each of the species. If d is the maximum number of changes that have occurred in a motif, then it is very likely that the number of mutations in one or more of the species is (possibly much) less than d. To account for this possibility we introduce a new model of motif in this paper. This model is called the (l, d 1 , d 2 )-motif model and is defined as follows. Input are n sequences each of length m and three integers l, d 1 , and d 2 , where d2 1 . The (l, d 1 , d 2 )-motif search (LDDMS) problem is to identify all the strings M (called (l, d 1 , d 2 ) motifs) of length l each such that M occurs in all the input sequences within a hamming distance of d1 and it occurs in at least one of the input sequences within a hamming distance of d 2 . This model is more general than the (l, d)-motif model and hence is NP-hard as well.