A Multi-Split Cross-Strategy for Enhancing Machine Learning Algorithms Prediction Results with Data Generated by Conditional Generative Adversarial Network
Abdelfattah Abassi, Brahim Bakkas, Moustapha El Jai, Ahmed Arid, Hussain Benazza
{"title":"A Multi-Split Cross-Strategy for Enhancing Machine Learning Algorithms Prediction Results with Data Generated by Conditional Generative Adversarial Network","authors":"Abdelfattah Abassi, Brahim Bakkas, Moustapha El Jai, Ahmed Arid, Hussain Benazza","doi":"10.3844/jcssp.2024.700.707","DOIUrl":null,"url":null,"abstract":": In this study, we present a Multi-Split Cross-Strategy (MSC-Strategy) designed to leverage synthetic tabular data generated by a Conditional Generative Adversarial Network (CGAN). Our study aims to investigate the potential of synthetic data in comparison to real-world data for improving machine learning predictive results. Firstly, we develop a CGAN architecture tailored to generate synthetic tabular data, trained on a comprehensive real-world dataset. Secondly, we validate the synthetic data generated by the CGAN to ensure its statistical fidelity and resemblance to the distribution of real data. Finally, we selectively leverage a subset of the generated data and apply our strategy to create a new combined training set comprising the training set of real data and the chosen subset of generated data. To validate our approach, we employ six diverse regression models: Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), XGB Regressor (XGB), and Support Vector Regressor (SVR). Each model is trained and tested using a training set of real data, generated data, combined data (training set of real data and generated data), and data formed by our MSC strategy. Our findings indicate that the training set formed by our MSC strategy demonstrates remarkable predictive performance compared to real-world data and generated data, highlighting its ability to enhance the prediction of machine learning models using only a subset of generated data.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":"69 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3844/jcssp.2024.700.707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
: In this study, we present a Multi-Split Cross-Strategy (MSC-Strategy) designed to leverage synthetic tabular data generated by a Conditional Generative Adversarial Network (CGAN). Our study aims to investigate the potential of synthetic data in comparison to real-world data for improving machine learning predictive results. Firstly, we develop a CGAN architecture tailored to generate synthetic tabular data, trained on a comprehensive real-world dataset. Secondly, we validate the synthetic data generated by the CGAN to ensure its statistical fidelity and resemblance to the distribution of real data. Finally, we selectively leverage a subset of the generated data and apply our strategy to create a new combined training set comprising the training set of real data and the chosen subset of generated data. To validate our approach, we employ six diverse regression models: Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), XGB Regressor (XGB), and Support Vector Regressor (SVR). Each model is trained and tested using a training set of real data, generated data, combined data (training set of real data and generated data), and data formed by our MSC strategy. Our findings indicate that the training set formed by our MSC strategy demonstrates remarkable predictive performance compared to real-world data and generated data, highlighting its ability to enhance the prediction of machine learning models using only a subset of generated data.
期刊介绍:
Journal of Computer Science is aimed to publish research articles on theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. JCS updated twelve times a year and is a peer reviewed journal covers the latest and most compelling research of the time.