基于交叉采样的非平衡数据集遗传规划二元文本分类

Turkish J. Electr. Eng. Comput. Sci. Pub Date : 2023-01-01 DOI:10.55730/1300-0632.3978

Mona Khalifa A. Aljero, Nazife Dimililer

{"title":"基于交叉采样的非平衡数据集遗传规划二元文本分类","authors":"Mona Khalifa A. Aljero, Nazife Dimililer","doi":"10.55730/1300-0632.3978","DOIUrl":null,"url":null,"abstract":": It is well known that classifiers trained using imbalanced datasets usually have a bias toward the majority class. In this context, classification models can present a high classification performance overall and for the majority class, even when the performance for the minority class is significantly lower. This paper presents a genetic programming (GP) model with a crossover-based oversampling technique for oversampling the imbalanced dataset for binary text classification. The aim of this study is to apply an oversampling technique to solve the imbalanced issue and improve the performance of the GP model that employed the proposed technique. The proposed technique employs a crossover operator for generating new samples for the minority class in an imbalanced text dataset. By using a combination of this crossover-based oversampling technique with GP, the performance was improved. It is shown that the proposed combination outperforms all GP applications that use the original dataset without resampling. Moreover, the performance of the proposed system surpassed GP approaches using the synthetic minority oversampling technique (SMOTE) and random oversampling. Further comparison with the state-of-the-art on five imbalanced text datasets in terms of F1-score shows the superior performance of the proposed approach.","PeriodicalId":23352,"journal":{"name":"Turkish J. Electr. Eng. Comput. Sci.","volume":"250 1","pages":"180-192"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Binary text classification using genetic programming with crossover-based oversampling for imbalanced datasets\",\"authors\":\"Mona Khalifa A. Aljero, Nazife Dimililer\",\"doi\":\"10.55730/1300-0632.3978\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": It is well known that classifiers trained using imbalanced datasets usually have a bias toward the majority class. In this context, classification models can present a high classification performance overall and for the majority class, even when the performance for the minority class is significantly lower. This paper presents a genetic programming (GP) model with a crossover-based oversampling technique for oversampling the imbalanced dataset for binary text classification. The aim of this study is to apply an oversampling technique to solve the imbalanced issue and improve the performance of the GP model that employed the proposed technique. The proposed technique employs a crossover operator for generating new samples for the minority class in an imbalanced text dataset. By using a combination of this crossover-based oversampling technique with GP, the performance was improved. It is shown that the proposed combination outperforms all GP applications that use the original dataset without resampling. Moreover, the performance of the proposed system surpassed GP approaches using the synthetic minority oversampling technique (SMOTE) and random oversampling. Further comparison with the state-of-the-art on five imbalanced text datasets in terms of F1-score shows the superior performance of the proposed approach.\",\"PeriodicalId\":23352,\"journal\":{\"name\":\"Turkish J. Electr. Eng. Comput. Sci.\",\"volume\":\"250 1\",\"pages\":\"180-192\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Turkish J. Electr. Eng. Comput. Sci.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.55730/1300-0632.3978\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish J. Electr. Eng. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55730/1300-0632.3978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

众所周知，使用不平衡数据集训练的分类器通常对大多数类有偏见。在这种情况下，分类模型可以在总体上和多数类中表现出较高的分类性能，即使少数类的性能明显较低。本文提出了一种遗传规划(GP)模型，并采用基于交叉的过采样技术对二元文本分类中的不平衡数据集进行过采样。本研究的目的是应用过采样技术来解决不平衡问题，并提高采用该技术的GP模型的性能。提出的技术采用交叉算子为不平衡文本数据集中的少数类生成新的样本。通过将这种基于交叉的过采样技术与GP相结合，提高了性能。结果表明，该组合优于所有使用原始数据集而不重新采样的GP应用程序。此外，该系统的性能优于使用合成少数过采样技术(SMOTE)和随机过采样的GP方法。进一步在五个不平衡文本数据集上与最先进的f1分数进行比较，表明了所提出方法的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Binary text classification using genetic programming with crossover-based oversampling for imbalanced datasets

: It is well known that classifiers trained using imbalanced datasets usually have a bias toward the majority class. In this context, classification models can present a high classification performance overall and for the majority class, even when the performance for the minority class is significantly lower. This paper presents a genetic programming (GP) model with a crossover-based oversampling technique for oversampling the imbalanced dataset for binary text classification. The aim of this study is to apply an oversampling technique to solve the imbalanced issue and improve the performance of the GP model that employed the proposed technique. The proposed technique employs a crossover operator for generating new samples for the minority class in an imbalanced text dataset. By using a combination of this crossover-based oversampling technique with GP, the performance was improved. It is shown that the proposed combination outperforms all GP applications that use the original dataset without resampling. Moreover, the performance of the proposed system surpassed GP approaches using the synthetic minority oversampling technique (SMOTE) and random oversampling. Further comparison with the state-of-the-art on five imbalanced text datasets in terms of F1-score shows the superior performance of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Turkish J. Electr. Eng. Comput. Sci.

自引率

0.00%

发文量