{"title":"Synthetic Data Generation Using Genetic Algorithm","authors":"Pratyusha Thogarchety, K. Das","doi":"10.1109/INOCON57975.2023.10101072","DOIUrl":null,"url":null,"abstract":"Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.","PeriodicalId":113637,"journal":{"name":"2023 2nd International Conference for Innovation in Technology (INOCON)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference for Innovation in Technology (INOCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INOCON57975.2023.10101072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Statistical machine learning models suffer poorly because of class imbalance issue. Real world dataset contains mostly ‘normal’ examples and very few ‘abnormal’ examples and in most of the cases, the primary goal is to identify the abnormal instances. For example, if we want to develop a statistical machine learning model to identify financial fraud using the historical transaction data then we can expect that majority of the data comes from normal/non-fraudulent class, whereas very few examples are fraudulent transactions. Using such imbalanced dataset for training makes machine learning models highly biased towards majority non-fraudulent class. This way, the objective to catch fraudulent transaction instances fails and misclassifying such minority class instances often results in a much higher cost. Hence, a balanced dataset is very much required to train a sound model. Different techniques such as under sampling, oversampling, SMOTE were proposed earlier. In this paper, we propose a novel technique to generate synthetic data using genetic search algorithm. We examined the effectiveness of our proposed algorithm on different datasets and reported in section V.