带通配符的最长公共扩展:权衡与应用

Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya
{"title":"带通配符的最长公共扩展:权衡与应用","authors":"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya","doi":"arxiv-2408.03610","DOIUrl":null,"url":null,"abstract":"We study the Longest Common Extension (LCE) problem in a string containing\nwildcards. Wildcards (also called \"don't cares\" or \"holes\") are special\ncharacters that match any other character in the alphabet, similar to the\ncharacter \"?\" in Unix commands or \".\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\ngroups of wildcards in the input string. Our main contribution is a simple data\nstructure for this problem that can be built in $O(n (G/t) \\log n)$ time,\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\in [1\n.. G]$. Up to the $O(\\log n)$ factor, this interpolates smoothly between the\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\ntime and space, and $O(1)$ query time, and a simple solution based on the\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\nmultiplication, we show that our solution is optimal up to subpolynomial\nfactors when $G = \\Omega(n)$ under a widely believed hypothesis. In addition,\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\nalgorithms for approximate pattern matching and structural analysis of strings\nwith wildcards.","PeriodicalId":501525,"journal":{"name":"arXiv - CS - Data Structures and Algorithms","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Longest Common Extensions with Wildcards: Trade-off and Applications\",\"authors\":\"Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya\",\"doi\":\"arxiv-2408.03610\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the Longest Common Extension (LCE) problem in a string containing\\nwildcards. Wildcards (also called \\\"don't cares\\\" or \\\"holes\\\") are special\\ncharacters that match any other character in the alphabet, similar to the\\ncharacter \\\"?\\\" in Unix commands or \\\".\\\" in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous\\ngroups of wildcards in the input string. Our main contribution is a simple data\\nstructure for this problem that can be built in $O(n (G/t) \\\\log n)$ time,\\noccupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \\\\in [1\\n.. G]$. Up to the $O(\\\\log n)$ factor, this interpolates smoothly between the\\ndata structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing\\ntime and space, and $O(1)$ query time, and a simple solution based on the\\n``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has\\n$O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix\\nmultiplication, we show that our solution is optimal up to subpolynomial\\nfactors when $G = \\\\Omega(n)$ under a widely believed hypothesis. In addition,\\nwe develop a new simple, deterministic and combinatorial algorithm for sparse\\nBoolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient\\nalgorithms for approximate pattern matching and structural analysis of strings\\nwith wildcards.\",\"PeriodicalId\":501525,\"journal\":{\"name\":\"arXiv - CS - Data Structures and Algorithms\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Data Structures and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.03610\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Data Structures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们研究的是包含通配符的字符串中的最长公共扩展(LCE)问题。通配符(也称为 "don't cares "或 "holes")是可以匹配字母表中任何其他字符的特殊字符,类似于 Unix 命令中的"? "或正则表达式引擎中的"."。我们考虑的问题参数为 $G$,即输入字符串中通配符的最大连续组数。我们的主要贡献是为这个问题提供了一个简单的数据结构,它可以在 $O(n (G/t) \log n)$ 时间内构建,占用 $O(nG/t)$ 空间,并在 $O(t)$ 时间内回答查询,适用于 [1... G]$ 中的任意 $t。在$O(\log n)$因子范围内,这可以在Crochemore等人的数据结构[JDA 2015](其预处理时间和空间为$O(nG)$,查询时间为$O(1)$)和基于 "袋鼠跳 "技术的简单解决方案[Landau and Vishkin, STOC 1986](其预处理时间和空间为$O(n)$,查询时间为$O(G)$)之间平滑插值。通过在这个问题和布尔矩阵乘法之间建立联系,我们证明了当 $G = \Omega(n)$ 时,我们的解决方案在一个普遍认为的假设条件下是最优的,达到了亚对数因子。此外,我们还为稀疏布尔矩阵乘法开发了一种新的简单、确定性和组合算法。最后,我们展示了我们的数据结构可以用来获得近似模式匹配和带通配符字符串结构分析的高效算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Longest Common Extensions with Wildcards: Trade-off and Applications
We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in $O(n (G/t) \log n)$ time, occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \in [1 .. G]$. Up to the $O(\log n)$ factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing time and space, and $O(1)$ query time, and a simple solution based on the ``kangaroo jumping'' technique [Landau and Vishkin, STOC 1986], which has $O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal up to subpolynomial factors when $G = \Omega(n)$ under a widely believed hypothesis. In addition, we develop a new simple, deterministic and combinatorial algorithm for sparse Boolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信