Taisho Sasada, Masataka Kawai, Yuzo Taenaka, Doudou Fall, Y. Kadobayashi
{"title":"Differentially-Private Text Generation via Text Preprocessing to Reduce Utility Loss","authors":"Taisho Sasada, Masataka Kawai, Yuzo Taenaka, Doudou Fall, Y. Kadobayashi","doi":"10.1109/ICAIIC51459.2021.9415242","DOIUrl":null,"url":null,"abstract":"To provide user-generated texts to third parties, various anonymization used to process the texts. Since this anonymization assume the knowledge possessed by the adversary, sensitive information may be leaked depending on the adversary’s knowledge even after this anonymization. Moreover, setting the strongest assumptions about the adversary’s knowledge leads to the degradation of the utility as the data by removing any quasi-identifiers. Therefore, instead of providing original data, a method to generate differentially-private synthetic data has been proposed. Differential privacy is more flexible than anonymization technologies because it does not require the assumption of the adversary’s knowledge. However, if a large noise is added to the gradient in text generative model to satisfy differential privacy, the utility of the synthetic text is degraded. Since differential privacy can be satisfied with a small noise in data containing duplicates, it is possible to reduce utility loss as text by creating duplicates before adding noise. In this study, we reduce the amount of noise added by creating duplicates through generalization, thereby minimizing text utility loss. By constructing a differentially-private text generation model, we can provide synthetic text and promote text utilization while protecting privacy information in the text.","PeriodicalId":432977,"journal":{"name":"2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC51459.2021.9415242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
To provide user-generated texts to third parties, various anonymization used to process the texts. Since this anonymization assume the knowledge possessed by the adversary, sensitive information may be leaked depending on the adversary’s knowledge even after this anonymization. Moreover, setting the strongest assumptions about the adversary’s knowledge leads to the degradation of the utility as the data by removing any quasi-identifiers. Therefore, instead of providing original data, a method to generate differentially-private synthetic data has been proposed. Differential privacy is more flexible than anonymization technologies because it does not require the assumption of the adversary’s knowledge. However, if a large noise is added to the gradient in text generative model to satisfy differential privacy, the utility of the synthetic text is degraded. Since differential privacy can be satisfied with a small noise in data containing duplicates, it is possible to reduce utility loss as text by creating duplicates before adding noise. In this study, we reduce the amount of noise added by creating duplicates through generalization, thereby minimizing text utility loss. By constructing a differentially-private text generation model, we can provide synthetic text and promote text utilization while protecting privacy information in the text.