{"title":"Faster FFT-based Wildcard Pattern Matching","authors":"Mihail Stoian","doi":"10.1145/3555041.3589391","DOIUrl":null,"url":null,"abstract":"We study the problem of pattern matching with wildcards, which naturally occurs in the SQL expression like. It consists in finding the occurrences of a pattern P, |P| = m, in a text T, |T| = n, where the pattern possibly contains wildcards, i.e., special characters that can match any letter of the alphabet. The naive algorithm to this problem achieves O(nm) since in O(m) we need to check at each position of T whether a match is possible. For this purpose, several algorithms have been proposed, the simplest one being a deterministic FFT-based algorithm where pattern matching is interpreted in algebraic form, i.e., P = T iff (P-T)^2 = 0. This naturally leads to an O(n log n) algorithm via FFT, as we can evaluate the binomial and search for zero-valued coefficients. Clifford et al. introduce a trick to achieve O(n log m): Instead of matching the entire text to the pattern, the text is divided into n / m overlapping slices of length 2m, which are then matched to the pattern in O(m log m). The total time complexity is then O((n / m) m log m) = O(n log m). We mention that other works, especially in pattern matching with errors, rely on this trick. However, the O-expression hides in this case a factor of 4, assuming m = 2^k. This is because FFT-based matching between strings of length m and 2m, respectively, actually requires 4m log 4m steps, since the result is of size 3m - 1 and FFT requires a power of two as the size. We argue that this trick incurs redundancy, and show how it can be discarded to achieve a twice as fast O(n log m) algorithm without compromise. Furthermore, we show by experiments that the proposed algorithm approaches the theoretical improvement.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We study the problem of pattern matching with wildcards, which naturally occurs in the SQL expression like. It consists in finding the occurrences of a pattern P, |P| = m, in a text T, |T| = n, where the pattern possibly contains wildcards, i.e., special characters that can match any letter of the alphabet. The naive algorithm to this problem achieves O(nm) since in O(m) we need to check at each position of T whether a match is possible. For this purpose, several algorithms have been proposed, the simplest one being a deterministic FFT-based algorithm where pattern matching is interpreted in algebraic form, i.e., P = T iff (P-T)^2 = 0. This naturally leads to an O(n log n) algorithm via FFT, as we can evaluate the binomial and search for zero-valued coefficients. Clifford et al. introduce a trick to achieve O(n log m): Instead of matching the entire text to the pattern, the text is divided into n / m overlapping slices of length 2m, which are then matched to the pattern in O(m log m). The total time complexity is then O((n / m) m log m) = O(n log m). We mention that other works, especially in pattern matching with errors, rely on this trick. However, the O-expression hides in this case a factor of 4, assuming m = 2^k. This is because FFT-based matching between strings of length m and 2m, respectively, actually requires 4m log 4m steps, since the result is of size 3m - 1 and FFT requires a power of two as the size. We argue that this trick incurs redundancy, and show how it can be discarded to achieve a twice as fast O(n log m) algorithm without compromise. Furthermore, we show by experiments that the proposed algorithm approaches the theoretical improvement.