Seungchan Roh, Seunghwan Song, Kwan-Yong Park, Byoung-mo Koo, Jun-Geol Baek
{"title":"Quality boost of tabular data synthesis using interpolative cumulative distribution function decoding and type-specific conditioner","authors":"Seungchan Roh, Seunghwan Song, Kwan-Yong Park, Byoung-mo Koo, Jun-Geol Baek","doi":"10.1016/j.neucom.2025.130484","DOIUrl":null,"url":null,"abstract":"<div><div>Tabular data synthesis is an important research area in terms of privacy and data utilization. To enhance the utilization of tabular data, data synthesis techniques are extensively explored. The primary goal of tabular data synthesis is to generate high-quality data that preserve original insights while reducing the risk of data breaches. In this study, we propose a novel generative adversarial network (GAN) for quality boost of tabular data synthesis. The method of transforming continuous variables and correct conditioning for capturing dependencies between variables is considered a critical factor in determining data quality. Therefore, our proposed method uses interpolative cumulative distribution function (CDF) decoding for continuous columns and type-specific conditioner. Interpolative CDF decoding addresses a limitation of the inverse CDF method that restricts the diversity of synthetic data. In addition, the type-specific conditioner conditions the interdependencies between columns by integrating both discrete and continuous conditions. The introduction of conditional dependencies enables the generator to accurately capture complex dependencies between columns, thereby enhancing the fidelity of the synthetic data. The proposed framework, encompassing the interpolation in the decoding process and the generation method for conditions, serves to render synthetic data more realistic. A comprehensive evaluation on six datasets demonstrated that the proposed method is effective in terms of data quality, usability, and privacy level of the synthesized data. The source code is available at <span><span>https://github.com/rch1025/Tabular-GAN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"645 ","pages":"Article 130484"},"PeriodicalIF":5.5000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225011567","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Tabular data synthesis is an important research area in terms of privacy and data utilization. To enhance the utilization of tabular data, data synthesis techniques are extensively explored. The primary goal of tabular data synthesis is to generate high-quality data that preserve original insights while reducing the risk of data breaches. In this study, we propose a novel generative adversarial network (GAN) for quality boost of tabular data synthesis. The method of transforming continuous variables and correct conditioning for capturing dependencies between variables is considered a critical factor in determining data quality. Therefore, our proposed method uses interpolative cumulative distribution function (CDF) decoding for continuous columns and type-specific conditioner. Interpolative CDF decoding addresses a limitation of the inverse CDF method that restricts the diversity of synthetic data. In addition, the type-specific conditioner conditions the interdependencies between columns by integrating both discrete and continuous conditions. The introduction of conditional dependencies enables the generator to accurately capture complex dependencies between columns, thereby enhancing the fidelity of the synthetic data. The proposed framework, encompassing the interpolation in the decoding process and the generation method for conditions, serves to render synthetic data more realistic. A comprehensive evaluation on six datasets demonstrated that the proposed method is effective in terms of data quality, usability, and privacy level of the synthesized data. The source code is available at https://github.com/rch1025/Tabular-GAN.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.