Faster FFT-based Wildcard Pattern Matching

Companion of the 2023 International Conference on Management of Data Pub Date : 2023-06-04 DOI:10.1145/3555041.3589391

Mihail Stoian

{"title":"Faster FFT-based Wildcard Pattern Matching","authors":"Mihail Stoian","doi":"10.1145/3555041.3589391","DOIUrl":null,"url":null,"abstract":"We study the problem of pattern matching with wildcards, which naturally occurs in the SQL expression like. It consists in finding the occurrences of a pattern P, |P| = m, in a text T, |T| = n, where the pattern possibly contains wildcards, i.e., special characters that can match any letter of the alphabet. The naive algorithm to this problem achieves O(nm) since in O(m) we need to check at each position of T whether a match is possible. For this purpose, several algorithms have been proposed, the simplest one being a deterministic FFT-based algorithm where pattern matching is interpreted in algebraic form, i.e., P = T iff (P-T)^2 = 0. This naturally leads to an O(n log n) algorithm via FFT, as we can evaluate the binomial and search for zero-valued coefficients. Clifford et al. introduce a trick to achieve O(n log m): Instead of matching the entire text to the pattern, the text is divided into n / m overlapping slices of length 2m, which are then matched to the pattern in O(m log m). The total time complexity is then O((n / m) m log m) = O(n log m). We mention that other works, especially in pattern matching with errors, rely on this trick. However, the O-expression hides in this case a factor of 4, assuming m = 2^k. This is because FFT-based matching between strings of length m and 2m, respectively, actually requires 4m log 4m steps, since the result is of size 3m - 1 and FFT requires a power of two as the size. We argue that this trick incurs redundancy, and show how it can be discarded to achieve a twice as fast O(n log m) algorithm without compromise. Furthermore, we show by experiments that the proposed algorithm approaches the theoretical improvement.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We study the problem of pattern matching with wildcards, which naturally occurs in the SQL expression like. It consists in finding the occurrences of a pattern P, |P| = m, in a text T, |T| = n, where the pattern possibly contains wildcards, i.e., special characters that can match any letter of the alphabet. The naive algorithm to this problem achieves O(nm) since in O(m) we need to check at each position of T whether a match is possible. For this purpose, several algorithms have been proposed, the simplest one being a deterministic FFT-based algorithm where pattern matching is interpreted in algebraic form, i.e., P = T iff (P-T)^2 = 0. This naturally leads to an O(n log n) algorithm via FFT, as we can evaluate the binomial and search for zero-valued coefficients. Clifford et al. introduce a trick to achieve O(n log m): Instead of matching the entire text to the pattern, the text is divided into n / m overlapping slices of length 2m, which are then matched to the pattern in O(m log m). The total time complexity is then O((n / m) m log m) = O(n log m). We mention that other works, especially in pattern matching with errors, rely on this trick. However, the O-expression hides in this case a factor of 4, assuming m = 2^k. This is because FFT-based matching between strings of length m and 2m, respectively, actually requires 4m log 4m steps, since the result is of size 3m - 1 and FFT requires a power of two as the size. We argue that this trick incurs redundancy, and show how it can be discarded to achieve a twice as fast O(n log m) algorithm without compromise. Furthermore, we show by experiments that the proposed algorithm approaches the theoretical improvement.

查看原文本刊更多论文

更快的基于fft的通配符模式匹配

我们研究了使用通配符的模式匹配问题，这在SQL表达式中很自然地会出现。它包括在文本T， |T| = n中查找模式P， |P| = m的出现，其中模式可能包含通配符，即可以匹配字母表中的任何字母的特殊字符。这个问题的朴素算法达到了O(nm)，因为在O(m)中我们需要检查T的每个位置是否可能匹配。为此，已经提出了几种算法，最简单的一种是基于确定性fft的算法，其中模式匹配以代数形式解释，即P = T iff (P-T)^2 = 0。这自然会导致通过FFT的O(n log n)算法，因为我们可以计算二项式并搜索零值系数。Clifford等人介绍了一种实现O(n log m)的技巧:不是将整个文本与模式匹配，而是将文本分成n / m个重叠的长度为2m的切片，然后在O(m log m)内与模式匹配，然后总时间复杂度为O((n / m) m log m) = O(n log m).我们提到其他工作，特别是在有错误的模式匹配中，都依赖于这个技巧。然而，在这种情况下，0表达式隐藏了一个因子4，假设m = 2^k。这是因为基于FFT的字符串长度分别为m和2m之间的匹配实际上需要4m log 4m步，因为结果的大小是3m - 1，而FFT需要2的幂作为大小。我们认为这种技巧会导致冗余，并展示了如何在不妥协的情况下丢弃它以实现两倍快的O(n log m)算法。此外，通过实验表明，本文提出的算法接近理论改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion of the 2023 International Conference on Management of Data

自引率

0.00%

发文量