Mai Alzamel, Lorraine A. K. Ayad, G. Bernardini, R. Grossi, C. Iliopoulos, N. Pisanti, S. Pissis, Giovanna Rosone
{"title":"简并字符串比较及其应用","authors":"Mai Alzamel, Lorraine A. K. Ayad, G. Bernardini, R. Grossi, C. Iliopoulos, N. Pisanti, S. Pissis, Giovanna Rosone","doi":"10.4230/LIPIcs.WABI.2018.21","DOIUrl":null,"url":null,"abstract":"A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"137 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Degenerate String Comparison and Applications\",\"authors\":\"Mai Alzamel, Lorraine A. K. Ayad, G. Bernardini, R. Grossi, C. Iliopoulos, N. Pisanti, S. Pissis, Giovanna Rosone\",\"doi\":\"10.4230/LIPIcs.WABI.2018.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.\",\"PeriodicalId\":329847,\"journal\":{\"name\":\"Workshop on Algorithms in Bioinformatics\",\"volume\":\"137 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Algorithms in Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.WABI.2018.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Algorithms in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.WABI.2018.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
摘要
广义简并字符串(GD字符串)S^是总大小为n的n个字符串集合的序列,其中第i个集合包含相同长度k_i的字符串,但该长度在不同的集合之间可以变化。我们表示这些长度的和k_0, k_1,…,k_{n-1} by W.这类不确定序列可以表示宽度为W的无间隙多序列排列的紧致形式。本文的第一个结果是一个O(N+M)时间算法,用于确定两个总大小分别为N和M的GD字符串在整数字母表上的交集是否为非空。这个结果是基于一个独立兴趣的组合结果:虽然两个GD字符串的交集在两个字符串的总大小中可以是指数的,但它只能在线性空间中表示。采用基于自动机的方法可以获得类似的结果,但其成本与字母有关。然后我们应用字符串比较算法来计算GD字符串中的回文。我们提出了一个O(min{W,n^2} n)时间的算法来计算S^中的所有回文。此外,我们给出了S^中计算最大回文数的一个类似的条件下界。最后,使用真实的蛋白质数据集给出了概念验证实验结果。
A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.