Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens
{"title":"一种混合匿名化管道,用于ML目的,改善敏感数据集的隐私-效用平衡","authors":"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens","doi":"10.1145/3600160.3600168","DOIUrl":null,"url":null,"abstract":"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.","PeriodicalId":107145,"journal":{"name":"Proceedings of the 18th International Conference on Availability, Reliability and Security","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes\",\"authors\":\"Jenno Verdonck, Kevin De Boeck, M. Willocx, Jorn Lapon, Vincent Naessens\",\"doi\":\"10.1145/3600160.3600168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.\",\"PeriodicalId\":107145,\"journal\":{\"name\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3600160.3600168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3600160.3600168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A hybrid anonymization pipeline to improve the privacy-utility balance in sensitive datasets for ML purposes
The modern world is data-driven. Businesses increasingly take strategic decisions based on customer data, and companies are founded with a sole focus of performing machine-learning driven data analytics for third parties. External data sources containing sensitive records are often required to build qualitative machine learning models and, hence, perform accurate and meaningful predictions. However, exchanging sensitive datasets is no sinecure. Personal data must be managed according to privacy regulation. Similarly, loss of strategic data can negatively impact the competitiveness of a company. In both cases, dataset anonymization can overcome the aforementioned obstacles. This work proposes a hybrid anonymization pipeline combining masking and (intelligent) sampling to improve the privacy-utility balance of anonymized datasets. The approach is validated via in-depth experiments on a representative machine learning scenario. A quantitative privacy assessment of the proposed hybrid anonymization pipeline is performed and relies on two well-known privacy metrics, namely re-identification risk and certainty. Furthermore, this work shows that the utility level of the anonymized dataset remains acceptable, and that the overall privacy-utility balance increases when complementing masking with intelligent sampling. The study further restrains the common misconception that dataset anonymization is detrimental to the quality of machine learning models. The empirical study shows that anonymous datasets – generated by the hybrid anonymization pipeline – can compete with the original (identifiable) ones when they are used as input for training a machine learning model.