Sliding Window String Indexing in Streams

P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen
{"title":"Sliding Window String Indexing in Streams","authors":"P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen","doi":"10.48550/arXiv.2301.09477","DOIUrl":null,"url":null,"abstract":"Given a string $S$ over an alphabet $\\Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(\\log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $\\delta$ characters that arrive from either stream. We present an $O(w + \\delta)$ space data structure for this problem that improves the above time bounds to $O(\\log(w/\\delta))$. In particular, for a delay of $\\delta = \\epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"61 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.09477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Given a string $S$ over an alphabet $\Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(\log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $\delta$ characters that arrive from either stream. We present an $O(w + \delta)$ space data structure for this problem that improves the above time bounds to $O(\log(w/\delta))$. In particular, for a delay of $\delta = \epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
流中的滑动窗口字符串索引
给定字母$\Sigma$上的字符串$S$,“字符串索引问题”是预处理$S$以随后支持有效的模式匹配查询,即,给定模式字符串$P$报告$S$中所有$P$的出现情况。本文研究了“流滑动窗口字符串索引问题”。在这里,字符串$S$作为一个流到达,一次一个字符,目标是维护指定参数$w$的最后一个$w$字符的索引,称为“窗口”。在任何时间点,模式$P$的模式匹配查询都可能到达,也是一次流式传输一个字符,并且必须返回当前窗口中出现的所有$P$。流滑动窗口字符串索引问题自然地捕获了我们想要索引流的最新数据(即窗口),同时支持高效模式匹配的场景。我们的主要结果是一个简单的$O(w)$空间数据结构,它大概率地使用$O(\log w)$时间来处理输入字符串$S$和模式字符串$P$中的每个字符。从$P$报告每个事件会使用额外的常量时间。与之前在类似场景中的工作相比,该结果首次实现了输入流中每个字符的有效最坏情况时间。我们还考虑了该问题的延迟变体,其中查询可以在来自任何一个流的下一个$\delta$字符内的任何点得到回答。针对这个问题,我们提出了一个$O(w + \delta)$空间数据结构,将上述时间界限提高到$O(\log(w/\delta))$。特别是,对于延迟$\delta = \epsilon w$,我们获得了一个$O(w)$空间数据结构,每个字符的处理时间是恒定的。实现我们的结果的关键思想是一个新颖而简单的独立兴趣后缀树的层次结构,灵感来自经典的日志结构合并树。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信