Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya
{"title":"带通配符的最长公共扩展:权衡与应用","authors":"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya","doi":"arxiv-2408.03610","DOIUrl":null,"url":null,"abstract":"We study the Longest Common Extension (LCE) problem in a string containing\nwildcards. Wildcards (also called \"don't cares\" or \"holes\") are special\ncharacters that match any other character in the alphabet, similar to the\ncharacter \"?\" in Unix commands or \".\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\ngroups of wildcards in the input string. Our main contribution is a simple data\nstructure for this problem that can be built in $O(n (G/t) \\log n)$ time,\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\in [1\n.. G]$. Up to the $O(\\log n)$ factor, this interpolates smoothly between the\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\ntime and space, and $O(1)$ query time, and a simple solution based on the\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\nmultiplication, we show that our solution is optimal up to subpolynomial\nfactors when $G = \\Omega(n)$ under a widely believed hypothesis. In addition,\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\nalgorithms for approximate pattern matching and structural analysis of strings\nwith wildcards.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Longest Common Extensions with Wildcards: Trade-off and Applications\",\"authors\":\"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya\",\"doi\":\"arxiv-2408.03610\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the Longest Common Extension (LCE) problem in a string containing\\nwildcards. Wildcards (also called \\\"don't cares\\\" or \\\"holes\\\") are special\\ncharacters that match any other character in the alphabet, similar to the\\ncharacter \\\"?\\\" in Unix commands or \\\".\\\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\\ngroups of wildcards in the input string. Our main contribution is a simple data\\nstructure for this problem that can be built in $O(n (G/t) \\\\log n)$ time,\\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\\\in [1\\n.. G]$. Up to the $O(\\\\log n)$ factor, this interpolates smoothly between the\\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\\ntime and space, and $O(1)$ query time, and a simple solution based on the\\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\\nmultiplication, we show that our solution is optimal up to subpolynomial\\nfactors when $G = \\\\Omega(n)$ under a widely believed hypothesis. In addition,\\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\\nalgorithms for approximate pattern matching and structural analysis of strings\\nwith wildcards.\",\"PeriodicalId\":501525,\"journal\":{\"name\":\"arXiv - CS - Data Structures and Algorithms\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Data Structures and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.03610\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Longest Common Extensions with Wildcards: Trade-off and Applications
We study the Longest Common Extension (LCE) problem in a string containing
wildcards. Wildcards (also called "don't cares" or "holes") are special
characters that match any other character in the alphabet, similar to the
character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous
groups of wildcards in the input string. Our main contribution is a simple data
structure for this problem that can be built in $O(n (G/t) \log n)$ time,
occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \in [1
.. G]$. Up to the $O(\log n)$ factor, this interpolates smoothly between the
data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing
time and space, and $O(1)$ query time, and a simple solution based on the
``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has
$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix
multiplication, we show that our solution is optimal up to subpolynomial
factors when $G = \Omega(n)$ under a widely believed hypothesis. In addition,
we develop a new simple, deterministic and combinatorial algorithm for sparse
Boolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient
algorithms for approximate pattern matching and structural analysis of strings
with wildcards.