带通配符的最长公共扩展：权衡与应用

arXiv - CS - Data Structures and Algorithms Pub Date : 2024-08-07 DOI:arxiv-2408.03610

Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya

{"title":"带通配符的最长公共扩展：权衡与应用","authors":"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya","doi":"arxiv-2408.03610","DOIUrl":null,"url":null,"abstract":"We study the Longest Common Extension (LCE) problem in a string containing\nwildcards. Wildcards (also called \"don't cares\" or \"holes\") are special\ncharacters that match any other character in the alphabet, similar to the\ncharacter \"?\" in Unix commands or \".\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\ngroups of wildcards in the input string. Our main contribution is a simple data\nstructure for this problem that can be built in $O(n (G/t) \\log n)$ time,\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\in [1\n.. G]$. Up to the $O(\\log n)$ factor, this interpolates smoothly between the\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\ntime and space, and $O(1)$ query time, and a simple solution based on the\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\nmultiplication, we show that our solution is optimal up to subpolynomial\nfactors when $G = \\Omega(n)$ under a widely believed hypothesis. In addition,\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\nalgorithms for approximate pattern matching and structural analysis of strings\nwith wildcards.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Longest Common Extensions with Wildcards: Trade-off and Applications\",\"authors\":\"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya\",\"doi\":\"arxiv-2408.03610\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the Longest Common Extension (LCE) problem in a string containing\\nwildcards. Wildcards (also called \\\"don't cares\\\" or \\\"holes\\\") are special\\ncharacters that match any other character in the alphabet, similar to the\\ncharacter \\\"?\\\" in Unix commands or \\\".\\\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\\ngroups of wildcards in the input string. Our main contribution is a simple data\\nstructure for this problem that can be built in $O(n (G/t) \\\\log n)$ time,\\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\\\in [1\\n.. G]$. Up to the $O(\\\\log n)$ factor, this interpolates smoothly between the\\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\\ntime and space, and $O(1)$ query time, and a simple solution based on the\\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\\nmultiplication, we show that our solution is optimal up to subpolynomial\\nfactors when $G = \\\\Omega(n)$ under a widely believed hypothesis. In addition,\\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\\nalgorithms for approximate pattern matching and structural analysis of strings\\nwith wildcards.\",\"PeriodicalId\":501525,\"journal\":{\"name\":\"arXiv - CS - Data Structures and Algorithms\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Data Structures and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.03610\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们研究的是包含通配符的字符串中的最长公共扩展（LCE）问题。通配符（也称为 "don't cares "或 "holes"）是可以匹配字母表中任何其他字符的特殊字符，类似于 Unix 命令中的"? "或正则表达式引擎中的"."。我们考虑的问题参数为 $G$，即输入字符串中通配符的最大连续组数。我们的主要贡献是为这个问题提供了一个简单的数据结构，它可以在 $O(n (G/t) \log n)$ 时间内构建，占用 $O(nG/t)$ 空间，并在 $O(t)$ 时间内回答查询，适用于 [1... G]$ 中的任意 $t。在$O(\log n)$因子范围内，这可以在Crochemore等人的数据结构[JDA 2015]（其预处理时间和空间为$O(nG)$，查询时间为$O(1)$）和基于 "袋鼠跳 "技术的简单解决方案[Landau and Vishkin, STOC 1986]（其预处理时间和空间为$O(n)$，查询时间为$O(G)$）之间平滑插值。通过在这个问题和布尔矩阵乘法之间建立联系，我们证明了当 $G = \Omega(n)$ 时，我们的解决方案在一个普遍认为的假设条件下是最优的，达到了亚对数因子。此外，我们还为稀疏布尔矩阵乘法开发了一种新的简单、确定性和组合算法。最后，我们展示了我们的数据结构可以用来获得近似模式匹配和带通配符字符串结构分析的高效算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Longest Common Extensions with Wildcards: Trade-off and Applications

We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in $O(n (G/t) \log n)$ time, occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \in [1 .. G]$. Up to the $O(\log n)$ factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing time and space, and $O(1)$ query time, and a simple solution based on the ``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has $O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal up to subpolynomial factors when $G = \Omega(n)$ under a widely believed hypothesis. In addition, we develop a new simple, deterministic and combinatorial algorithm for sparse Boolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Data Structures and Algorithms

自引率

0.00%

发文量