移动条件GAN接近数据：合成表格数据生成及其实验评价

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2024-08-13 DOI:10.1109/TBDATA.2024.3442534

Abdul Majeed;Seong Oun Hwang

{"title":"移动条件GAN接近数据：合成表格数据生成及其实验评价","authors":"Abdul Majeed;Seong Oun Hwang","doi":"10.1109/TBDATA.2024.3442534","DOIUrl":null,"url":null,"abstract":"Recently, data has ousted oil as the most economical resource in the world, but most companies are reluctant to share customer/user data in pure form and on a large scale due to privacy concerns. Many innovative technologies (e.g., federated learning, split learning) are employed to meet the growing demand for privacy preservation. Despite these technologies, acquiring personal data in order to optimize utility, and then sharing it on a large scale, is still very challenging. Thanks to the rapid development of artificial intelligence (AI), a relatively new and promising solution to resolve these challenges is to generate synthetic data (SD) by mirroring the original dataset’s properties. SD is a promising solution to address growing privacy demands as well as the utility/analytics requirements of many industry stakeholders. In this paper, we propose and implement an SD generation method from a real dataset containing both numerical and categorical attributes by using an improved conditional generative adversarial network (CGAN), and we quantify the feasibility of SD on technical and theoretical grounds. We provide a detailed analysis of SD in original and anonymized forms with the help of multiple use cases, whereas prior research simply assumed that privacy issues in SD are small because AI models do not overfit or SD has a poor connection with real data. We provide insights into the characteristics of SD (distributions, value frequencies, correlations, etc.) produced by the CGAN in relation to the real data. To the best of our knowledge, this is the pioneering work that provides an experiment-based analysis of the quality, privacy, and utility of SD in relation to a real benchmark dataset.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1188-1205"},"PeriodicalIF":7.5000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Moving Conditional GAN Close to Data: Synthetic Tabular Data Generation and Its Experimental Evaluation\",\"authors\":\"Abdul Majeed;Seong Oun Hwang\",\"doi\":\"10.1109/TBDATA.2024.3442534\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, data has ousted oil as the most economical resource in the world, but most companies are reluctant to share customer/user data in pure form and on a large scale due to privacy concerns. Many innovative technologies (e.g., federated learning, split learning) are employed to meet the growing demand for privacy preservation. Despite these technologies, acquiring personal data in order to optimize utility, and then sharing it on a large scale, is still very challenging. Thanks to the rapid development of artificial intelligence (AI), a relatively new and promising solution to resolve these challenges is to generate synthetic data (SD) by mirroring the original dataset’s properties. SD is a promising solution to address growing privacy demands as well as the utility/analytics requirements of many industry stakeholders. In this paper, we propose and implement an SD generation method from a real dataset containing both numerical and categorical attributes by using an improved conditional generative adversarial network (CGAN), and we quantify the feasibility of SD on technical and theoretical grounds. We provide a detailed analysis of SD in original and anonymized forms with the help of multiple use cases, whereas prior research simply assumed that privacy issues in SD are small because AI models do not overfit or SD has a poor connection with real data. We provide insights into the characteristics of SD (distributions, value frequencies, correlations, etc.) produced by the CGAN in relation to the real data. To the best of our knowledge, this is the pioneering work that provides an experiment-based analysis of the quality, privacy, and utility of SD in relation to a real benchmark dataset.\",\"PeriodicalId\":13106,\"journal\":{\"name\":\"IEEE Transactions on Big Data\",\"volume\":\"11 3\",\"pages\":\"1188-1205\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10634770/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10634770/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

最近，数据已经取代石油成为世界上最经济的资源，但由于隐私问题，大多数公司都不愿意以纯粹的形式和大规模地共享客户/用户数据。许多创新技术（例如，联邦学习、分裂学习）被用于满足日益增长的隐私保护需求。尽管有这些技术，获取个人数据以优化效用，然后大规模地共享它，仍然是非常具有挑战性的。由于人工智能（AI）的快速发展，解决这些挑战的一个相对较新的和有前途的解决方案是通过镜像原始数据集的属性来生成合成数据（SD）。SD是一个很有前途的解决方案，可以满足不断增长的隐私需求以及许多行业利益相关者的实用/分析需求。本文利用改进的条件生成对抗网络（CGAN），从包含数值和分类属性的真实数据集提出并实现了一种SD生成方法，并从技术和理论角度量化了SD的可行性。我们在多个用例的帮助下，以原始和匿名的形式对SD进行了详细的分析，而之前的研究只是假设SD中的隐私问题很小，因为AI模型不会过拟合，或者SD与真实数据的联系很差。我们提供了对由CGAN产生的与真实数据相关的SD（分布，值频率，相关性等）特征的见解。据我们所知，这是一项开创性的工作，它提供了基于实验的分析，与真实基准数据集相关的SD的质量、隐私和效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Moving Conditional GAN Close to Data: Synthetic Tabular Data Generation and Its Experimental Evaluation

Recently, data has ousted oil as the most economical resource in the world, but most companies are reluctant to share customer/user data in pure form and on a large scale due to privacy concerns. Many innovative technologies (e.g., federated learning, split learning) are employed to meet the growing demand for privacy preservation. Despite these technologies, acquiring personal data in order to optimize utility, and then sharing it on a large scale, is still very challenging. Thanks to the rapid development of artificial intelligence (AI), a relatively new and promising solution to resolve these challenges is to generate synthetic data (SD) by mirroring the original dataset’s properties. SD is a promising solution to address growing privacy demands as well as the utility/analytics requirements of many industry stakeholders. In this paper, we propose and implement an SD generation method from a real dataset containing both numerical and categorical attributes by using an improved conditional generative adversarial network (CGAN), and we quantify the feasibility of SD on technical and theoretical grounds. We provide a detailed analysis of SD in original and anonymized forms with the help of multiple use cases, whereas prior research simply assumed that privacy issues in SD are small because AI models do not overfit or SD has a poor connection with real data. We provide insights into the characteristics of SD (distributions, value frequencies, correlations, etc.) produced by the CGAN in relation to the real data. To the best of our knowledge, this is the pioneering work that provides an experiment-based analysis of the quality, privacy, and utility of SD in relation to a real benchmark dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.