Sliding Window String Indexing in Streams

Annual Symposium on Combinatorial Pattern Matching Pub Date : 2023-01-23 DOI:10.48550/arXiv.2301.09477

P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen

{"title":"Sliding Window String Indexing in Streams","authors":"P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen","doi":"10.48550/arXiv.2301.09477","DOIUrl":null,"url":null,"abstract":"Given a string $S$ over an alphabet $\\Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(\\log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $\\delta$ characters that arrive from either stream. We present an $O(w + \\delta)$ space data structure for this problem that improves the above time bounds to $O(\\log(w/\\delta))$. In particular, for a delay of $\\delta = \\epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"61 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.09477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Given a string $S$ over an alphabet $\Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(\log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $\delta$ characters that arrive from either stream. We present an $O(w + \delta)$ space data structure for this problem that improves the above time bounds to $O(\log(w/\delta))$. In particular, for a delay of $\delta = \epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.

查看原文本刊更多论文

流中的滑动窗口字符串索引

给定字母$\Sigma$上的字符串$S$，“字符串索引问题”是预处理$S$以随后支持有效的模式匹配查询，即，给定模式字符串$P$报告$S$中所有$P$的出现情况。本文研究了“流滑动窗口字符串索引问题”。在这里，字符串$S$作为一个流到达，一次一个字符，目标是维护指定参数$w$的最后一个$w$字符的索引，称为“窗口”。在任何时间点，模式$P$的模式匹配查询都可能到达，也是一次流式传输一个字符，并且必须返回当前窗口中出现的所有$P$。流滑动窗口字符串索引问题自然地捕获了我们想要索引流的最新数据(即窗口)，同时支持高效模式匹配的场景。我们的主要结果是一个简单的$O(w)$空间数据结构，它大概率地使用$O(\log w)$时间来处理输入字符串$S$和模式字符串$P$中的每个字符。从$P$报告每个事件会使用额外的常量时间。与之前在类似场景中的工作相比，该结果首次实现了输入流中每个字符的有效最坏情况时间。我们还考虑了该问题的延迟变体，其中查询可以在来自任何一个流的下一个$\delta$字符内的任何点得到回答。针对这个问题，我们提出了一个$O(w + \delta)$空间数据结构，将上述时间界限提高到$O(\log(w/\delta))$。特别是，对于延迟$\delta = \epsilon w$，我们获得了一个$O(w)$空间数据结构，每个字符的处理时间是恒定的。实现我们的结果的关键思想是一个新颖而简单的独立兴趣后缀树的层次结构，灵感来自经典的日志结构合并树。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Symposium on Combinatorial Pattern Matching

自引率

0.00%

发文量