GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model

Sushma Jaiswal, Priyanka Gupta
{"title":"GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model","authors":"Sushma Jaiswal, Priyanka Gupta","doi":"10.52756/ijerr.2023.v30.004","DOIUrl":null,"url":null,"abstract":"Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.","PeriodicalId":190842,"journal":{"name":"International Journal of Experimental Research and Review","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Experimental Research and Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52756/ijerr.2023.v30.004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.
GLSTM:一种利用gan和LSTM分类模型对真实和合成PID糖尿病数据进行预测的新方法
生成对抗网络(GAN)是现代人工系统的一次革命。基于深度学习的生成对抗网络生成真实的合成表格数据。合成数据用于增强相对较小的训练数据集的大小,同时确保原始数据的机密性。在这种情况下,我们实现了生成糖尿病数据的GAN框架,以帮助医疗保健专业人员进行更多的临床应用。GAN用于验证皮马印第安人糖尿病(PID)数据集。各种预处理技术,如处理缺失值、异常值和数据不平衡问题,提高了数据质量。一些探索性数据分析,如热图、条形图和直方图,用于数据可视化。我们采用假设检验来检验真实数据与gan生成的合成数据之间的相似性。在这项研究中,我们提出了一个GAN- long - short - Memory (GLSTM)系统,其中GAN用于数据增强,LSTM用于糖尿病分类。此外,各种GAN模型,如CTGAN, Vanilla GAN, Coupula GAN,高斯Coupula GAN和TVAE GAN被用于生成合成数据集。实验采用真实数据、合成数据、真实数据与合成数据相结合的方法进行。同时使用真实数据和合成数据的模型的准确率为97%,而仅使用真实数据的模型的准确率为92%。我们还观察到,合成数据可以用来代替真实数据,因为合成数据与真实数据的平均相关性为0.93。与最先进的方法相比,我们的研究结果表现得更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信