{"title":"MONI can find k-MEMs","authors":"T. Gagie","doi":"10.4230/LIPIcs.CPM.2023.26","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2023.26","url":null,"abstract":"Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m log n)$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss
{"title":"The Normalized Edit Distance with Uniform Operation Costs is a Metric","authors":"D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss","doi":"10.4230/LIPIcs.CPM.2022.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.17","url":null,"abstract":"We prove that the normalized edit distance proposed in [Marzal and Vidal 1993] is a metric when the cost of all the edit operations are the same. This closes a long standing gap in the literature where several authors noted that this distance does not satisfy the triangle inequality in the general case, and that it was not known whether it is satisfied in the uniform case – where all the edit costs are equal. We compare this metric to two normalized metrics proposed as alternatives in the literature, when people thought that Marzal’s and Vidal’s distance is not a metric, and identify key properties that explain why the original distance, now known to also be a metric, is better for some applications. Our examination is from a point of view of formal verification, but the properties and their significance are stated in an application agnostic way.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123238484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arbitrary-length analogs to de Bruijn sequences","authors":"Abhinav Nellore, Rachel A. Ward","doi":"10.4230/LIPIcs.CPM.2022.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.9","url":null,"abstract":"Let $widetilde{alpha}$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $mathcal{A}$ such that the number of occurrences of any length-$m$ string on $mathcal{A}$ as a substring of $widetilde{alpha}$ is $lfloor L / K^m rfloor$ or $lceil L / K^m rceil$. When $L = K^N$ for any positive integer $N$, $widetilde{alpha}$ is a de Bruijn sequence of order $N$, and when $L neq K^N$, $widetilde{alpha}$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $widetilde{alpha}$ for any combination of $K geq 2$ and $L geq 1$ in $O(L)$ time using $O(L log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122336978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov
{"title":"Ranking Bracelets in Polynomial Time","authors":"Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov","doi":"10.4230/LIPIcs.CPM.2021.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.4","url":null,"abstract":"The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k^2 n^4), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals
{"title":"A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs","authors":"Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals","doi":"10.4230/LIPIcs.CPM.2021.22","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.22","url":null,"abstract":"The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127920748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi
{"title":"Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping","authors":"Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi","doi":"10.4230/LIPIcs.CPM.2017.11","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.11","url":null,"abstract":"Abstract In the Maximum-Duo Preservation String Mapping ( Max-Duo PSM ) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in 4 k ⋅ n O ( 1 ) time, and a deterministic algorithm that solves this problem in 6.855 k ⋅ n O ( 1 ) time. The previous best known (deterministic) algorithm for this problem has ( 8 e ) 2 k + o ( k ) ⋅ n O ( 1 ) running time [Beretta et al. (2016) [1] , [2] ]. We also show that Max-Duo PSM admits a problem kernel of size O ( k 3 ) , improving upon the previous best known problem kernel of size O ( k 6 ) .","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115421742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea
{"title":"AWLCO: All-Window Length Co-Occurrence","authors":"Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea","doi":"10.4230/LIPIcs.CPM.2021.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.24","url":null,"abstract":"Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity $O(n)$ for each window length and thus a total complexity of $O(n^2)$ and the space complexity $O(|I|)$ for a sequence of size n and an itemset of size $|I|$. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the expected time complexity of $O(n)$ and space complexity of $O( sqrt{ n|I| })$. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with expected time complexity $O(n|I|)$ and space complexity $O( sqrt{n|I|} + e_{max}|I|)$, where $e_{max}$ is the length of the largest pattern.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124608711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Longest Run Subsequence Problem: Further Complexity Results","authors":"R. Dondi, F. Sikora","doi":"10.4230/LIPIcs.CPM.2021.14","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.14","url":null,"abstract":"Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al.,WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132486939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering
{"title":"String Sanitization Under Edit Distance: Improved and Generalized","authors":"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2021.19","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.19","url":null,"abstract":"Let $W$ be a string of length $n$ over an alphabet $Sigma$, $k$ be a positive integer, and $mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{mathrm{ED}}$ such that: (i) no string of $mathcal{S}$ occurs in $X_{mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Sigma$ is the same in $W$ and in $X_{mathrm{ED}}$; and (iii) $X_{mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \u0000ETFS can be solved in $mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $mathcal{O}(n^{2-delta})$ time, for any $delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $mathcal{O}(n^2log^2k)$-time algorithm to solve ETFS; and (ii) an $mathcal{O}(n^2log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124765700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering
{"title":"String Sanitization Under Edit Distance","authors":"G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2020.7","url":null,"abstract":"textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}