Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering
{"title":"Missing value replacement in strings and applications.","authors":"Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering","doi":"10.1007/s10618-024-01074-3","DOIUrl":null,"url":null,"abstract":"<p><p>Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more <i>valid</i> letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the <i>context</i> of the missing value (i.e., its vicinity) as well as a finite set of user-defined <i>forbidden</i> patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to <i>minimize the number of new letters</i> we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain <i>forbidden edges</i> representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to <i>fully</i> sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 2","pages":"12"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11754389/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01074-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.
在现实世界的序列(字符串)数据集中,由于以下原因经常出现缺失值:(1)不精确的数据测量;(2)灵活的序列建模,如分子序列的结合谱;或者(3)数据集中存在机密信息,为保护隐私而被故意删除。为了分析这样的数据集,通常重要的是用一个或多个有效的字母替换每个缺失的值,以一种高效和有效的方式。在这里,我们将此任务形式化为组合优化问题:约束集包括缺失值的上下文(即其附近)以及用户定义的禁止模式的有限集,建模,例如,不可信或机密模式;目标函数寻求最小化我们引入的新字母的数量。从算法上讲,我们的问题转化为在包含表示禁止模式的禁止边的特殊图中找到最短路径。我们的工作做出了以下贡献:(1)我们设计了一个线性时间算法来解决恒定长度字母上字符串的这个问题;(2)我们展示了我们的算法如何在存在一组固定长度的禁止模式的情况下毫不费力地应用于完全净化私有字符串[Bernardini et al. 2021a];(3)我们提出了一种利用我们的算法和有效且高效的可计算距离度量对私有字符串集合进行消毒和聚类的方法;(4)我们提供了大量的实验结果,表明我们的方法可以有效地清理私有字符串集合,同时保持聚类质量,优于现有的技术和基线。为了得到我们的理论结果,我们使用了形式语言和组合模式匹配的技术。
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.