Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computers Pub Date : 2023-12-10 DOI:10.3390/computers12120258

Sunghae Jun

{"title":"Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling","authors":"Sunghae Jun","doi":"10.3390/computers12120258","DOIUrl":null,"url":null,"abstract":"In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"848 ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.

查看原文本刊更多论文

利用生成式对抗网络和统计建模进行零膨胀文本数据分析

在大数据分析中，各种膨胀零问题层出不穷。其中，零膨胀问题对文本大数据分析影响很大。一般来说，文本文档的预处理数据是一个矩阵，分别由行和列的文档和术语组成。该矩阵的每个元素都是术语在文档中的出现频率。由于列数远大于行数，矩阵中的大部分元素都是零。这个问题是导致文本数据分析中模型性能下降的原因之一。为了克服这个问题，我们提出了一种利用生成式对抗网络（GAN）和统计建模进行零膨胀文本数据分析的方法。在本文中，我们使用从原始数据生成的零膨胀合成数据来解决零膨胀问题。我们研究的主要发现是如何通过生成式对抗网络将零值变为带有随机噪声的极小数值。GAN 的生成器和判别器共同学习了零膨胀文本数据，并建立了一个模型，生成可以替代零膨胀数据的合成数据。我们使用真实数据集和模拟数据集进行了实验并展示了结果，以验证我们所提方法的改进性能。在实验中，我们使用了预测平方和、R 方、对数似然、阿凯克信息准则和贝叶斯信息准则这五个定量指标来评估模型在原始数据集和合成数据集之间的性能。我们发现，我们提出的方法的所有性能都优于传统方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊