Efficient Regular Expression Matching Based on Positional Inverted Index : (Extended Abstract)

2023 IEEE 39th International Conference on Data Engineering (ICDE) Pub Date : 2023-04-01 DOI:10.1109/ICDE55515.2023.00356

Tao Qiu, Xiaochun Yang, Bin Wang, Wei Wang

{"title":"Efficient Regular Expression Matching Based on Positional Inverted Index : (Extended Abstract)","authors":"Tao Qiu, Xiaochun Yang, Bin Wang, Wei Wang","doi":"10.1109/ICDE55515.2023.00356","DOIUrl":null,"url":null,"abstract":"We study the efficient regular expression (regex) matching problem. Existing algorithms are scanning-based algorithms that typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this paper, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00356","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We study the efficient regular expression (regex) matching problem. Existing algorithms are scanning-based algorithms that typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this paper, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.

查看原文本刊更多论文

基于位置倒排索引的高效正则表达式匹配(扩展摘要)

研究了高效正则表达式(regex)匹配问题。现有算法是基于扫描的算法，通常使用从regex查询编译的等效自动机来验证文档。尽管一些工作提出了各种策略来快速跳转到文档中可能出现查询结果的候选位置，但它们仍然需要使用基于扫描的方法来验证这些候选位置。当仍然有许多候选位置需要验证时，这些方法变得低效。在本文中，我们提出了一种新的方法来高效地计算基于位置q-gram倒排索引的regex查询的所有匹配位置。我们提出了一个克驱动的NFA来表示正则表达式的语言，并表明所有的正则表达式匹配位置都可以通过在GNFA的q-g上找到满足某些位置约束的位置来获得。然后，我们提出了几种基于gnfa的查询计划，利用位置倒排索引来回答查询。为了提高查询效率，我们设计了一种基于树的查询计划，通过仔细选择位置约束的检查顺序来构建查询计划。在真实数据集上的实验结果表明，我们的方法在查询效率上比最先进的方法高出一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE 39th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量