String Sanitization Under Edit Distance: Improved and Generalized

Annual Symposium on Combinatorial Pattern Matching Pub Date : 2020-07-16 DOI:10.4230/LIPIcs.CPM.2021.19

Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering

{"title":"String Sanitization Under Edit Distance: Improved and Generalized","authors":"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2021.19","DOIUrl":null,"url":null,"abstract":"Let $W$ be a string of length $n$ over an alphabet $\\Sigma$, $k$ be a positive integer, and $\\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\\mathrm{ED}}$ such that: (i) no string of $\\mathcal{S}$ occurs in $X_{\\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\\Sigma$ is the same in $W$ and in $X_{\\mathrm{ED}}$; and (iii) $X_{\\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \nETFS can be solved in $\\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\\mathcal{O}(n^{2-\\delta})$ time, for any $\\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\\mathcal{O}(n^2\\log^2k)$-time algorithm to solve ETFS; and (ii) an $\\mathcal{O}(n^2\\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.CPM.2021.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\Sigma$ is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\mathcal{O}(n^{2-\delta})$ time, for any $\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\mathcal{O}(n^2\log^2k)$-time algorithm to solve ETFS; and (ii) an $\mathcal{O}(n^2\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.

查看原文本刊更多论文

编辑距离下的字符串消毒:改进与推广

设$W$是一个长度为$n$的字符串，包含一个字母$\Sigma$, $k$是一个正整数，$\mathcal{S}$是$W$的一组长度为$k$的子字符串。ETFS问题要求我们构造一个字符串$X_{\mathrm{ED}}$，这样:(i) $X_{\mathrm{ED}}$中没有$\mathcal{S}$字符串;(ii) $\Sigma$上所有其他长度为$k$的子字符串的顺序在$W$和$X_{\mathrm{ED}}$中是相同的;(三)$X_{\mathrm{ED}}$到$W$的编辑距离最小。当$W$代表个人数据，$\mathcal{S}$代表一组机密模式时，ETFS问题要求转换$W$以保护其隐私和效用[Bernardini等人，ECML PKDD 2019]。ETFS可以在$\mathcal{O}(n^2k)$时间内求解[Bernardini et al.， CPM 2020]。同一篇论文表明，对于任何$\delta>0$，除非强指数时间假设(SETH)为假，否则ETFS不能在$\mathcal{O}(n^{2-\delta})$时间内求解。我们的主要成果可以总结如下:(i)求解ETFS的$\mathcal{O}(n^2\log^2k)$时间算法;(ii)求解AETFS的$\mathcal{O}(n^2\log^2n)$时间算法，这是ETFS的一种推广，其中$\mathcal{S}$的元素可以具有任意长度。我们的算法因此是最优的多对数因素，除非SETH失败。我们还需要强调的是，我们的算法可以在任意权重的编辑距离下工作，而不需要额外的成本。作为奖励，我们展示了如何修改一些已知的技术，这些技术可以加快标准编辑距离计算，以应用于我们的问题。除了字符串清理之外，我们的技术还可以启发解决与正则表达式或上下文无关语法相关的其他问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Symposium on Combinatorial Pattern Matching

自引率

0.00%

发文量