Pattern Masking for Dictionary Matching: Theory and Practice

IF 0.9 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Algorithmica Pub Date : 2024-03-06 DOI:10.1007/s00453-024-01213-8

Panagiotis Charalampopoulos, Huiping Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski

{"title":"Pattern Masking for Dictionary Matching: Theory and Practice","authors":"Panagiotis Charalampopoulos, Huiping Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski","doi":"10.1007/s00453-024-01213-8","DOIUrl":null,"url":null,"abstract":"<div>Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary \\(\\mathscr {D}\\) of d strings, each of length \\(\\ell \\), a query string q of length \\(\\ell \\), and a positive integer z, we are asked to compute a smallest set \\(K\\subseteq \\{1,\\ldots ,\\ell \\}\\), so that if q[i] is replaced by a wildcard for all \\(i\\in K\\), then q matches at least z strings from \\(\\mathscr {D}\\). Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial \\(\\mathscr {O}((d\\ell )^{|K|/3}+d\\ell )\\)-time and \\(\\mathscr {O}(d\\ell )\\)-space algorithm for PMDM for \\(|K|=\\mathscr {O}(1)\\). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in \\(\\mathscr {D}\\). PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and \\(z=1\\), one obtains the minimal number of mismatches of q with any string from \\(\\mathscr {D}\\). The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time \\(\\mathscr {O}(2^\\ell d)\\). We present a data structure for PMDM that answers queries over \\(\\mathscr {D}\\) in time \\(\\mathscr {O}(2^{\\ell /2}(2^{\\ell /2}+\\tau )\\ell )\\) and requires space \\(\\mathscr {O}(2^{\\ell }d^2/\\tau ^2+2^{\\ell /2}d)\\), for any parameter \\(\\tau \\in [1,d]\\). We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time \\(\\mathscr {O}(d^{1/4+\\epsilon })\\)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021.\n</div>","PeriodicalId":50824,"journal":{"name":"Algorithmica","volume":"86 6","pages":"1948 - 1978"},"PeriodicalIF":0.9000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00453-024-01213-8.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithmica","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s00453-024-01213-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary \(\mathscr {D}\) of d strings, each of length \(\ell \), a query string q of length \(\ell \), and a positive integer z, we are asked to compute a smallest set \(K\subseteq \{1,\ldots ,\ell \}\), so that if q[i] is replaced by a wildcard for all \(i\in K\), then q matches at least z strings from \(\mathscr {D}\). Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial \(\mathscr {O}((d\ell )^{|K|/3}+d\ell )\)-time and \(\mathscr {O}(d\ell )\)-space algorithm for PMDM for \(|K|=\mathscr {O}(1)\). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in \(\mathscr {D}\). PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and \(z=1\), one obtains the minimal number of mismatches of q with any string from \(\mathscr {D}\). The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time \(\mathscr {O}(2^\ell d)\). We present a data structure for PMDM that answers queries over \(\mathscr {D}\) in time \(\mathscr {O}(2^{\ell /2}(2^{\ell /2}+\tau )\ell )\) and requires space \(\mathscr {O}(2^{\ell }d^2/\tau ^2+2^{\ell /2}d)\), for any parameter \(\tau \in [1,d]\). We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time \(\mathscr {O}(d^{1/4+\epsilon })\)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021.

查看原文本刊更多论文

用于字典匹配的模式屏蔽：理论与实践

摘要数据屏蔽是对数据库系统中的敏感数据进行消毒的一种常用技术，在个人数据的记录链接等各种应用领域正变得越来越重要。这项工作正式提出了字典匹配模式掩码（PDM）问题：给定一个由 d 个字符串（每个字符串长度为 \(\ell \)）组成的字典 \(\mathscr {D}\) ，一个长度为 \(\ell \)的查询字符串 q，以及一个正整数 z、我们被要求计算一个最小的集合（Ksubseteq \{1,\ldots ,\ell \}\），这样如果 q[i] 被通配符替换为所有的 \(i\in K\) ，那么 q 至少匹配来自 \(\mathscr {D}\) 的 z 个字符串。与现有方法相比，求解 PMDM 可以提供数据效用保证。我们首先通过对著名的 k-Clique 问题的还原，证明 PMDM 问题的决策版本是 NP-complete，即使对于二进制字符串也是如此。因此，我们从更实用的角度来处理这个问题。我们展示了一个组合((d\ell )^{K||/3}+d\ell)\)-时间和（（mathscr {O}(d\ell )\)K|=\mathscr {O}(1)\) 的 PMDM 的-空间算法。事实上，我们表明，除非组合 k-Clique 假设失效，否则我们不可能希望有更快的组合算法（Abboud 等人，载于 SIAM J Comput 47:2527-2555, 2018；Lincoln 等人，载于：第 29 届 ACM-SIAM 离散算法研讨会（SODA），2018 年）。我们的组合算法以较小的|K|执行，是我们提出的贪婪启发式的支柱。我们在现实世界和合成数据集上的实验表明，我们的启发式算法在实践中能找到近乎最优的解决方案，而且非常高效。我们还将这种算法推广到同时屏蔽多个查询字符串的问题上，这样每个字符串在 \(\mathscr {D}\) 中至少有 z 个匹配项。PMDM 可以被看作是有错配字典匹配问题的决策版本的一般化：通过用字符串 q 和 \(z=1\) 查询 PMDM 数据结构，可以得到 q 与 \(\mathscr {D}\) 中任意字符串错配的最小数量。所有已知数据结构的查询时间或空间对于字典匹配的更受限制的问题（最多有k个不匹配）来说，都会产生一些与k有关的指数因子。PMDM的一个简单精确算法的运行时间是\(\mathscr {O}(2^\ell d)\)。我们提出了一种 PMDM 的数据结构，它可以在 \(\mathscr {O}(2^{\ell /2}(2^{\ell /2}+\tau )\) 的时间内回答对 \(\mathscr {D}\) 的查询，并且需要 \(\mathscr {O}(2^{\ell }d^2/\tau ^2+2^{\ell /2}d)\) 的空间。）对于任何参数（[1,d]中的）。我们通过展示 PMDM 与最小联合问题之间的双向多项式时间还原来补充我们的结果 [Chlamtáč 等人，ACM-SIAM 离散算法研讨会（SODA）2017]。这就给出了一个多项式时间（(\mathscr {O}(d^{1/4+\epsilon })\)-approximation算法，该算法在一个可信的复杂性猜想下是紧密的。本文是在 2021 年算法与计算国际研讨会（ISAAC）上发表的论文的扩展版。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithmica 工程技术-计算机：软件工程

CiteScore

2.80

自引率

9.10%

发文量

158

审稿时长

12 months

期刊介绍： Algorithmica is an international journal which publishes theoretical papers on algorithms that address problems arising in practical areas, and experimental papers of general appeal for practical importance or techniques. The development of algorithms is an integral part of computer science. The increasing complexity and scope of computer applications makes the design of efficient algorithms essential. Algorithmica covers algorithms in applied areas such as: VLSI, distributed computing, parallel processing, automated design, robotics, graphics, data base design, software tools, as well as algorithms in fundamental areas such as sorting, searching, data structures, computational geometry, and linear programming. In addition, the journal features two special sections: Application Experience, presenting findings obtained from applications of theoretical results to practical situations, and Problems, offering short papers presenting problems on selected topics of computer science.