A Multi-Split Cross-Strategy for Enhancing Machine Learning Algorithms Prediction Results with Data Generated by Conditional Generative Adversarial Network

Journal of Computer Science Pub Date : 2024-07-01 DOI:10.3844/jcssp.2024.700.707

Abdelfattah Abassi, Brahim Bakkas, Moustapha El Jai, Ahmed Arid, Hussain Benazza

{"title":"A Multi-Split Cross-Strategy for Enhancing Machine Learning Algorithms Prediction Results with Data Generated by Conditional Generative Adversarial Network","authors":"Abdelfattah Abassi, Brahim Bakkas, Moustapha El Jai, Ahmed Arid, Hussain Benazza","doi":"10.3844/jcssp.2024.700.707","DOIUrl":null,"url":null,"abstract":": In this study, we present a Multi-Split Cross-Strategy (MSC-Strategy) designed to leverage synthetic tabular data generated by a Conditional Generative Adversarial Network (CGAN). Our study aims to investigate the potential of synthetic data in comparison to real-world data for improving machine learning predictive results. Firstly, we develop a CGAN architecture tailored to generate synthetic tabular data, trained on a comprehensive real-world dataset. Secondly, we validate the synthetic data generated by the CGAN to ensure its statistical fidelity and resemblance to the distribution of real data. Finally, we selectively leverage a subset of the generated data and apply our strategy to create a new combined training set comprising the training set of real data and the chosen subset of generated data. To validate our approach, we employ six diverse regression models: Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), XGB Regressor (XGB), and Support Vector Regressor (SVR). Each model is trained and tested using a training set of real data, generated data, combined data (training set of real data and generated data), and data formed by our MSC strategy. Our findings indicate that the training set formed by our MSC strategy demonstrates remarkable predictive performance compared to real-world data and generated data, highlighting its ability to enhance the prediction of machine learning models using only a subset of generated data.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":"69 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3844/jcssp.2024.700.707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

: In this study, we present a Multi-Split Cross-Strategy (MSC-Strategy) designed to leverage synthetic tabular data generated by a Conditional Generative Adversarial Network (CGAN). Our study aims to investigate the potential of synthetic data in comparison to real-world data for improving machine learning predictive results. Firstly, we develop a CGAN architecture tailored to generate synthetic tabular data, trained on a comprehensive real-world dataset. Secondly, we validate the synthetic data generated by the CGAN to ensure its statistical fidelity and resemblance to the distribution of real data. Finally, we selectively leverage a subset of the generated data and apply our strategy to create a new combined training set comprising the training set of real data and the chosen subset of generated data. To validate our approach, we employ six diverse regression models: Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), XGB Regressor (XGB), and Support Vector Regressor (SVR). Each model is trained and tested using a training set of real data, generated data, combined data (training set of real data and generated data), and data formed by our MSC strategy. Our findings indicate that the training set formed by our MSC strategy demonstrates remarkable predictive performance compared to real-world data and generated data, highlighting its ability to enhance the prediction of machine learning models using only a subset of generated data.

查看原文本刊更多论文

利用条件生成对抗网络生成的数据增强机器学习算法预测结果的多分叉交叉策略

:在本研究中，我们提出了一种多分割交叉策略（MSC-Strategy），旨在利用条件生成对抗网络（CGAN）生成的合成表格数据。我们的研究旨在调查合成数据与真实世界数据相比在改善机器学习预测结果方面的潜力。首先，我们开发了一个专门用于生成合成表格数据的 CGAN 架构，并在一个全面的真实世界数据集上进行了训练。其次，我们对 CGAN 生成的合成数据进行验证，以确保其统计保真度和与真实数据分布的相似性。最后，我们选择性地利用生成数据的一个子集，并应用我们的策略创建一个新的组合训练集，其中包括真实数据训练集和所选的生成数据子集。为了验证我们的方法，我们采用了六种不同的回归模型：决策树 (DT)、K-近邻 (KNN)、随机森林 (RF)、XGB 回归模型 (XGB) 和支持向量回归模型 (SVR)。每个模型都使用由真实数据、生成数据、组合数据（由真实数据和生成数据组成的训练集）以及由我们的 MSC 策略形成的数据组成的训练集进行训练和测试。我们的研究结果表明，与真实世界数据和生成数据相比，我们的 MSC 策略所形成的训练集显示出卓越的预测性能，突出了它仅使用生成数据子集来增强机器学习模型预测能力的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computer Science Computer Science-Computer Networks and Communications

CiteScore

1.70

自引率

0.00%

发文量

期刊介绍： Journal of Computer Science is aimed to publish research articles on theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. JCS updated twelve times a year and is a peer reviewed journal covers the latest and most compelling research of the time.