{"title":"加强对高维数据的保护:带有特征选择的分布式差分隐私","authors":"I Made Putrama , Péter Martinek","doi":"10.1016/j.ipm.2024.103870","DOIUrl":null,"url":null,"abstract":"<div><p>The computational cost for implementing data privacy protection tends to rise as the dimensions increase, especially on correlated datasets. For this reason, a faster data protection mechanism is needed to handle high-dimensional data while balancing utility and privacy. This study introduces an innovative framework to improve the performance by leveraging distributed computing strategies. The framework integrates specific feature selection algorithms and distributed mutual information computation, which is crucial for sensitivity assessment. Additionally, it is optimized using a hyperparameter tuning technique based on Bayesian optimization, which focuses on minimizing either a combined score of the Bayesian information criterion (BIC) and Akaike’s Information Criterion (AIC) or by minimizing the Maximal Information Coefficient (MIC) score individually. Extensive testing on 12 datasets with tens to thousands of features was conducted for classification and regression tasks. With our method, the sensitivity of the resulting data is lower than alternative approaches, requiring less perturbation for an equivalent level of privacy. Using a novel Privacy Deviation Coefficient (PDC) metric, we assess the performance disparity between original and perturbed data. Overall, there is a significant execution time improvement of 64.30% on the computation, providing valuable insights for practical applications.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103870"},"PeriodicalIF":7.4000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing protection in high-dimensional data: Distributed differential privacy with feature selection\",\"authors\":\"I Made Putrama , Péter Martinek\",\"doi\":\"10.1016/j.ipm.2024.103870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The computational cost for implementing data privacy protection tends to rise as the dimensions increase, especially on correlated datasets. For this reason, a faster data protection mechanism is needed to handle high-dimensional data while balancing utility and privacy. This study introduces an innovative framework to improve the performance by leveraging distributed computing strategies. The framework integrates specific feature selection algorithms and distributed mutual information computation, which is crucial for sensitivity assessment. Additionally, it is optimized using a hyperparameter tuning technique based on Bayesian optimization, which focuses on minimizing either a combined score of the Bayesian information criterion (BIC) and Akaike’s Information Criterion (AIC) or by minimizing the Maximal Information Coefficient (MIC) score individually. Extensive testing on 12 datasets with tens to thousands of features was conducted for classification and regression tasks. With our method, the sensitivity of the resulting data is lower than alternative approaches, requiring less perturbation for an equivalent level of privacy. Using a novel Privacy Deviation Coefficient (PDC) metric, we assess the performance disparity between original and perturbed data. Overall, there is a significant execution time improvement of 64.30% on the computation, providing valuable insights for practical applications.</p></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"61 6\",\"pages\":\"Article 103870\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324002292\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002292","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Enhancing protection in high-dimensional data: Distributed differential privacy with feature selection
The computational cost for implementing data privacy protection tends to rise as the dimensions increase, especially on correlated datasets. For this reason, a faster data protection mechanism is needed to handle high-dimensional data while balancing utility and privacy. This study introduces an innovative framework to improve the performance by leveraging distributed computing strategies. The framework integrates specific feature selection algorithms and distributed mutual information computation, which is crucial for sensitivity assessment. Additionally, it is optimized using a hyperparameter tuning technique based on Bayesian optimization, which focuses on minimizing either a combined score of the Bayesian information criterion (BIC) and Akaike’s Information Criterion (AIC) or by minimizing the Maximal Information Coefficient (MIC) score individually. Extensive testing on 12 datasets with tens to thousands of features was conducted for classification and regression tasks. With our method, the sensitivity of the resulting data is lower than alternative approaches, requiring less perturbation for an equivalent level of privacy. Using a novel Privacy Deviation Coefficient (PDC) metric, we assess the performance disparity between original and perturbed data. Overall, there is a significant execution time improvement of 64.30% on the computation, providing valuable insights for practical applications.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.