一种混合匿名化管道，用于ML目的，改善敏感数据集的隐私-效用平衡

Proceedings of the 18th International Conference on Availability, Reliability and Security Pub Date : 2023-08-29 DOI:10.1145/3600160.3600168

Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens

{"title":"一种混合匿名化管道，用于ML目的，改善敏感数据集的隐私-效用平衡","authors":"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens","doi":"10.1145/3600160.3600168","DOIUrl":null,"url":null,"abstract":"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.","PeriodicalId":107145,"journal":{"name":"Proceedings of the 18th International Conference on Availability, Reliability and Security","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes\",\"authors\":\"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens\",\"doi\":\"10.1145/3600160.3600168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.\",\"PeriodicalId\":107145,\"journal\":{\"name\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3600160.3600168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3600160.3600168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

现代世界是数据驱动的。企业越来越多地根据客户数据做出战略决策，公司成立的唯一重点是为第三方执行机器学习驱动的数据分析。通常需要包含敏感记录的外部数据源来构建定性机器学习模型，从而执行准确而有意义的预测。然而，交换敏感数据集并不是一件轻松的事。个人资料必须按私隐条例管理。同样，战略数据的丢失也会对公司的竞争力产生负面影响。在这两种情况下，数据集匿名化都可以克服上述障碍。本研究提出了一种结合屏蔽和(智能)采样的混合匿名化管道，以改善匿名数据集的隐私-效用平衡。该方法通过代表性机器学习场景的深入实验进行了验证。对所提出的混合匿名化管道进行了定量隐私评估，并依赖于两个众所周知的隐私指标，即重新识别风险和确定性。此外，这项工作表明，匿名数据集的效用水平仍然是可以接受的，并且当用智能采样补充掩蔽时，总体隐私效用平衡增加。该研究进一步抑制了数据集匿名化对机器学习模型质量有害的常见误解。实证研究表明，当匿名数据集被用作训练机器学习模型的输入时，由混合匿名化管道生成的匿名数据集可以与原始(可识别的)数据集竞争。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes

The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 18th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量