GANBLR: A Tabular Data Generation Model

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI:10.1109/ICDM51629.2021.00103

Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Gang Li

{"title":"GANBLR: A Tabular Data Generation Model","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Gang Li","doi":"10.1109/ICDM51629.2021.00103","DOIUrl":null,"url":null,"abstract":"Generative Adversarial Network (GAN) models have shown to be effective in a wide range of machine learning applications, and tabular data generation process has not been an exception. Notably, some state-of-the-art models of tabular data generation, such as CTGAN, TableGan, MedGAN, etc. are based on GAN models. Even though these models have resulted in superiour performance in generating artificial data when trained on a range of datasets, there is a lot of room (and desire) for improvement. Not to mention that existing methods do have some weaknesses other than performance. E.g., the current methods focus only on the performance of the model, and limited emphasis is given to the interpretation of the model. Secondly, the current models operate on raw features only, and hence they fail to exploit any prior knowledge on explicit feature interactions that can be utilized during data generation process. To alleviate the two above-mentioned limitations, in this work, we propose a novel tabular data generation model– Generative Adversarial Network modelling inspired from Naive Bayes and Logistic Regression’s relationship (GANBLR), which can not only address the interpretation limitation in existing tabular GAN-based models but can provide capability to handle explicit feature interactions. By extensively evaluating on wide range of datasets, we demonstrate GANBLR’S superiour performance as well as better interpretable capability (explanation of feature importance in the synthetic generation process) as compared to existing state-of-the-art tabular data generation models.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Generative Adversarial Network (GAN) models have shown to be effective in a wide range of machine learning applications, and tabular data generation process has not been an exception. Notably, some state-of-the-art models of tabular data generation, such as CTGAN, TableGan, MedGAN, etc. are based on GAN models. Even though these models have resulted in superiour performance in generating artificial data when trained on a range of datasets, there is a lot of room (and desire) for improvement. Not to mention that existing methods do have some weaknesses other than performance. E.g., the current methods focus only on the performance of the model, and limited emphasis is given to the interpretation of the model. Secondly, the current models operate on raw features only, and hence they fail to exploit any prior knowledge on explicit feature interactions that can be utilized during data generation process. To alleviate the two above-mentioned limitations, in this work, we propose a novel tabular data generation model– Generative Adversarial Network modelling inspired from Naive Bayes and Logistic Regression’s relationship (GANBLR), which can not only address the interpretation limitation in existing tabular GAN-based models but can provide capability to handle explicit feature interactions. By extensively evaluating on wide range of datasets, we demonstrate GANBLR’S superiour performance as well as better interpretable capability (explanation of feature importance in the synthetic generation process) as compared to existing state-of-the-art tabular data generation models.

查看原文本刊更多论文

表格数据生成模型

生成对抗网络(GAN)模型已被证明在广泛的机器学习应用中是有效的，表格数据生成过程也不例外。值得注意的是，一些最先进的表格数据生成模型，如CTGAN、TableGan、MedGAN等，都是基于GAN模型。尽管这些模型在一系列数据集上训练后在生成人工数据方面表现优异，但仍有很大的改进空间(和愿望)。更不用说现有的方法除了性能之外还有一些弱点。例如，目前的方法只关注模型的性能，而对模型的解释重视有限。其次，目前的模型只对原始特征进行操作，因此它们无法利用任何可以在数据生成过程中使用的显式特征交互的先验知识。为了减轻上述两个限制，在这项工作中，我们提出了一种新的表格数据生成模型-受朴素贝叶斯和逻辑回归关系(GANBLR)启发的生成对抗网络模型(Generative Adversarial Network modeling)，它不仅可以解决现有基于表格gan模型的解释限制，而且可以提供处理显式特征交互的能力。通过对大范围数据集的广泛评估，我们证明了与现有的最先进的表格数据生成模型相比，GANBLR的优越性能以及更好的可解释能力(解释合成生成过程中的特征重要性)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量