Denize Palmito dos Santos, Julio Cezar Souza Vasconcelos
{"title":"Using Gaussian Copulas and Generative Adversarial Networks for Generating Synthetic Data in Beet Productivity Analysis","authors":"Denize Palmito dos Santos, Julio Cezar Souza Vasconcelos","doi":"10.1007/s12355-024-01506-w","DOIUrl":null,"url":null,"abstract":"<div><p>In scientific research, field experiments are essential to validate theories in real conditions. However, these investigations often face limitations due to sample scarcity, which can compromise the robustness and interpretability of results. Synthetic data generation offers an effective solution for expanding datasets, enabling more comprehensive analyses even when real data are limited. Although synthetic data are not real, it can maintain the mathematical and statistical properties of real data, making it a valuable tool for improving analytical accuracy. This study aims to generate synthetic data using two synthesizers: Gaussian Copulas and Generative Adversarial Neural Networks (GANs). The dataset used refers to the evaluation of the effects of different levels of nitrogen fertilizers (N) on the dry matter production of sugar beet roots. Five nitrogen fertilizers levels were tested: 0, 35, 70, 105, and 140 kg/ha, with a randomized block design containing three blocks and five plots per block. The focus of this research is to increase the sample size to consider different statistical and machine learning models. The comparison between synthetic and real data revealed that the Gaussian Copulas synthesizer outperformed the CTGAN synthesizer. This superiority was evidenced by the proximity of the graphical representations and the performance of the models compared to real data. Furthermore, the random forest model trained with synthetic data generated by Gaussian Copulas presented better performance metrics than the CTGAN synthesizer, indicating that synthetic data can be a valuable support in the analysis of agronomic experiments.</p></div>","PeriodicalId":781,"journal":{"name":"Sugar Tech","volume":"27 2","pages":"407 - 417"},"PeriodicalIF":1.8000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sugar Tech","FirstCategoryId":"97","ListUrlMain":"https://link.springer.com/article/10.1007/s12355-024-01506-w","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRONOMY","Score":null,"Total":0}
引用次数: 0
Abstract
In scientific research, field experiments are essential to validate theories in real conditions. However, these investigations often face limitations due to sample scarcity, which can compromise the robustness and interpretability of results. Synthetic data generation offers an effective solution for expanding datasets, enabling more comprehensive analyses even when real data are limited. Although synthetic data are not real, it can maintain the mathematical and statistical properties of real data, making it a valuable tool for improving analytical accuracy. This study aims to generate synthetic data using two synthesizers: Gaussian Copulas and Generative Adversarial Neural Networks (GANs). The dataset used refers to the evaluation of the effects of different levels of nitrogen fertilizers (N) on the dry matter production of sugar beet roots. Five nitrogen fertilizers levels were tested: 0, 35, 70, 105, and 140 kg/ha, with a randomized block design containing three blocks and five plots per block. The focus of this research is to increase the sample size to consider different statistical and machine learning models. The comparison between synthetic and real data revealed that the Gaussian Copulas synthesizer outperformed the CTGAN synthesizer. This superiority was evidenced by the proximity of the graphical representations and the performance of the models compared to real data. Furthermore, the random forest model trained with synthetic data generated by Gaussian Copulas presented better performance metrics than the CTGAN synthesizer, indicating that synthetic data can be a valuable support in the analysis of agronomic experiments.
期刊介绍:
The journal Sugar Tech is planned with every aim and objectives to provide a high-profile and updated research publications, comments and reviews on the most innovative, original and rigorous development in agriculture technologies for better crop improvement and production of sugar crops (sugarcane, sugar beet, sweet sorghum, Stevia, palm sugar, etc), sugar processing, bioethanol production, bioenergy, value addition and by-products. Inter-disciplinary studies of fundamental problems on the subjects are also given high priority. Thus, in addition to its full length and short papers on original research, the journal also covers regular feature articles, reviews, comments, scientific correspondence, etc.