Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Sunghae Jun
{"title":"Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling","authors":"Sunghae Jun","doi":"10.3390/computers12120258","DOIUrl":null,"url":null,"abstract":"In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"848 ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.
利用生成式对抗网络和统计建模进行零膨胀文本数据分析
在大数据分析中,各种膨胀零问题层出不穷。其中,零膨胀问题对文本大数据分析影响很大。一般来说,文本文档的预处理数据是一个矩阵,分别由行和列的文档和术语组成。该矩阵的每个元素都是术语在文档中的出现频率。由于列数远大于行数,矩阵中的大部分元素都是零。这个问题是导致文本数据分析中模型性能下降的原因之一。为了克服这个问题,我们提出了一种利用生成式对抗网络(GAN)和统计建模进行零膨胀文本数据分析的方法。在本文中,我们使用从原始数据生成的零膨胀合成数据来解决零膨胀问题。我们研究的主要发现是如何通过生成式对抗网络将零值变为带有随机噪声的极小数值。GAN 的生成器和判别器共同学习了零膨胀文本数据,并建立了一个模型,生成可以替代零膨胀数据的合成数据。我们使用真实数据集和模拟数据集进行了实验并展示了结果,以验证我们所提方法的改进性能。在实验中,我们使用了预测平方和、R 方、对数似然、阿凯克信息准则和贝叶斯信息准则这五个定量指标来评估模型在原始数据集和合成数据集之间的性能。我们发现,我们提出的方法的所有性能都优于传统方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers
Computers COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-
CiteScore
5.40
自引率
3.60%
发文量
153
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信