编辑距离下的字符串消毒:改进与推广

Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering
{"title":"编辑距离下的字符串消毒:改进与推广","authors":"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2021.19","DOIUrl":null,"url":null,"abstract":"Let $W$ be a string of length $n$ over an alphabet $\\Sigma$, $k$ be a positive integer, and $\\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\\mathrm{ED}}$ such that: (i) no string of $\\mathcal{S}$ occurs in $X_{\\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\\Sigma$ is the same in $W$ and in $X_{\\mathrm{ED}}$; and (iii) $X_{\\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \nETFS can be solved in $\\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\\mathcal{O}(n^{2-\\delta})$ time, for any $\\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\\mathcal{O}(n^2\\log^2k)$-time algorithm to solve ETFS; and (ii) an $\\mathcal{O}(n^2\\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"String Sanitization Under Edit Distance: Improved and Generalized\",\"authors\":\"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering\",\"doi\":\"10.4230/LIPIcs.CPM.2021.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Let $W$ be a string of length $n$ over an alphabet $\\\\Sigma$, $k$ be a positive integer, and $\\\\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\\\\mathrm{ED}}$ such that: (i) no string of $\\\\mathcal{S}$ occurs in $X_{\\\\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\\\\Sigma$ is the same in $W$ and in $X_{\\\\mathrm{ED}}$; and (iii) $X_{\\\\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\\\\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \\nETFS can be solved in $\\\\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\\\\mathcal{O}(n^{2-\\\\delta})$ time, for any $\\\\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\\\\mathcal{O}(n^2\\\\log^2k)$-time algorithm to solve ETFS; and (ii) an $\\\\mathcal{O}(n^2\\\\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\\\\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.\",\"PeriodicalId\":236737,\"journal\":{\"name\":\"Annual Symposium on Combinatorial Pattern Matching\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Symposium on Combinatorial Pattern Matching\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.CPM.2021.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.CPM.2021.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

设$W$是一个长度为$n$的字符串,包含一个字母$\Sigma$, $k$是一个正整数,$\mathcal{S}$是$W$的一组长度为$k$的子字符串。ETFS问题要求我们构造一个字符串$X_{\mathrm{ED}}$,这样:(i) $X_{\mathrm{ED}}$中没有$\mathcal{S}$字符串;(ii) $\Sigma$上所有其他长度为$k$的子字符串的顺序在$W$和$X_{\mathrm{ED}}$中是相同的;(三)$X_{\mathrm{ED}}$到$W$的编辑距离最小。当$W$代表个人数据,$\mathcal{S}$代表一组机密模式时,ETFS问题要求转换$W$以保护其隐私和效用[Bernardini等人,ECML PKDD 2019]。ETFS可以在$\mathcal{O}(n^2k)$时间内求解[Bernardini et al., CPM 2020]。同一篇论文表明,对于任何$\delta>0$,除非强指数时间假设(SETH)为假,否则ETFS不能在$\mathcal{O}(n^{2-\delta})$时间内求解。我们的主要成果可以总结如下:(i)求解ETFS的$\mathcal{O}(n^2\log^2k)$时间算法;(ii)求解AETFS的$\mathcal{O}(n^2\log^2n)$时间算法,这是ETFS的一种推广,其中$\mathcal{S}$的元素可以具有任意长度。我们的算法因此是最优的多对数因素,除非SETH失败。我们还需要强调的是,我们的算法可以在任意权重的编辑距离下工作,而不需要额外的成本。作为奖励,我们展示了如何修改一些已知的技术,这些技术可以加快标准编辑距离计算,以应用于我们的问题。除了字符串清理之外,我们的技术还可以启发解决与正则表达式或上下文无关语法相关的其他问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
String Sanitization Under Edit Distance: Improved and Generalized
Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\Sigma$ is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\mathcal{O}(n^{2-\delta})$ time, for any $\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\mathcal{O}(n^2\log^2k)$-time algorithm to solve ETFS; and (ii) an $\mathcal{O}(n^2\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信