Synthetic Data Generation Using Genetic Algorithm

2023 2nd International Conference for Innovation in Technology (INOCON) Pub Date : 2023-03-03 DOI:10.1109/INOCON57975.2023.10101072

Pratyusha Thogarchety, K. Das

{"title":"Synthetic Data Generation Using Genetic Algorithm","authors":"Pratyusha Thogarchety, K. Das","doi":"10.1109/INOCON57975.2023.10101072","DOIUrl":null,"url":null,"abstract":"Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.","PeriodicalId":113637,"journal":{"name":"2023 2nd International Conference for Innovation in Technology (INOCON)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference for Innovation in Technology (INOCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INOCON57975.2023.10101072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.

查看原文本刊更多论文

基于遗传算法的合成数据生成

统计机器学习模型由于类不平衡问题而表现不佳。现实世界的数据集主要包含“正常”示例和极少数“异常”示例，在大多数情况下，主要目标是识别异常实例。例如，如果我们想开发一个统计机器学习模型来使用历史交易数据识别金融欺诈，那么我们可以预期大多数数据来自正常/非欺诈类，而很少有示例是欺诈性交易。使用这种不平衡的数据集进行训练使机器学习模型高度偏向于大多数非欺诈类。这样，捕获欺诈性事务实例的目标就失败了，并且错误分类此类少数类实例通常会导致更高的成本。因此，需要一个平衡的数据集来训练一个健全的模型。不同的技术，如欠采样，过采样，SMOTE之前提出。本文提出了一种利用遗传搜索算法生成合成数据的新技术。我们检查了我们提出的算法在不同数据集上的有效性，并在第V节中报告。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 2nd International Conference for Innovation in Technology (INOCON)

自引率

0.00%

发文量