Privacy-Preserving Record Linkage for Cardinality Counting

Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security Pub Date : 2023-01-09 DOI:10.1145/3579856.3590338

Nan Wu, Dinusha Vatsalan, M. Kâafar, Sanat Ramesh

{"title":"Privacy-Preserving Record Linkage for Cardinality Counting","authors":"Nan Wu, Dinusha Vatsalan, M. Kâafar, Sanat Ramesh","doi":"10.1145/3579856.3590338","DOIUrl":null,"url":null,"abstract":"Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget ϵ = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.","PeriodicalId":156082,"journal":{"name":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579856.3590338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget ϵ = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.

查看原文本刊更多论文

基数计数的隐私保护记录链接

一些应用程序需要计算数据中不同项的数量，这被称为基数计数问题。示例应用包括健康应用，例如对罕见疾病患者进行计数，以获得足够的认识和资金，并对新疾病的病例数进行计数，以检测疫情，营销应用，例如对新产品的可见性进行计数，以及网络安全应用，例如跟踪社交媒体帖子的唯一视图数。然而，计数所需的数据通常是个人且敏感的，需要使用隐私保护技术进行处理。不同数据库中的数据质量，例如错别字、错误和变化，对准确的基数估计提出了额外的挑战。虽然近年来隐私保护基数计数受到了广泛的关注，并且已经开发了一些用于基数估计的隐私保护算法，但迄今为止还没有使用带有模糊匹配和可证明隐私保证的记录链接技术进行隐私保护基数计数的工作。我们提出了一种新的隐私保护记录链接算法，该算法使用无监督聚类技术在不损害其隐私或身份的情况下链接和计数多个数据集中个体的基数。此外，现有的肘部方法寻找作为基数的最优簇数是不准确的，因为它们没有考虑到生成的簇的纯度和完整性。提出了一种寻找无监督学习中最优簇数的新方法。与最先进的模糊匹配和聚类方法相比，我们在真实数据集和合成数据集上的实验结果非常有希望，因为隐私预算λ = 1.0的错误率明显小于0.1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security

自引率

0.00%

发文量