Fatima Zahra Qachfar, Rakesh M. Verma, Arjun Mukherjee
{"title":"Leveraging Synthetic Data and PU Learning For Phishing Email Detection","authors":"Fatima Zahra Qachfar, Rakesh M. Verma, Arjun Mukherjee","doi":"10.1145/3508398.3511524","DOIUrl":null,"url":null,"abstract":"Imbalanced data classification has always been one of the most challenging problems in data science especially in the cybersecurity field, where we observe an out-of-balance proportion between benign and phishing examples in security datasets. Even though there are many phishing detection methods in literature, most of them neglect the imbalanced nature of phishing email datasets. In this paper, we examine the imbalanced property by varying legitimate to phishing class ratios. We generate new synthetic instances using a generative adversarial network model for long sentences (LeakGAN) to balance out the training process and ameliorate its impact on classification. These synthetic instances are labeled by positive-unlabeled learning and added to the initial imbalanced training set. The resulting dataset is given to the Bidirectional Encoder Representations from Transformers (BERT) model for sequence classification. We compare several state-of-the-art methods from the literature against our approach, which achieves a high performance throughout all the imbalanced ratios reaching an F1-score of 99.6% for the most extreme imbalanced ratio and an F1-score of 99.8% for balanced cases.","PeriodicalId":102306,"journal":{"name":"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508398.3511524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Imbalanced data classification has always been one of the most challenging problems in data science especially in the cybersecurity field, where we observe an out-of-balance proportion between benign and phishing examples in security datasets. Even though there are many phishing detection methods in literature, most of them neglect the imbalanced nature of phishing email datasets. In this paper, we examine the imbalanced property by varying legitimate to phishing class ratios. We generate new synthetic instances using a generative adversarial network model for long sentences (LeakGAN) to balance out the training process and ameliorate its impact on classification. These synthetic instances are labeled by positive-unlabeled learning and added to the initial imbalanced training set. The resulting dataset is given to the Bidirectional Encoder Representations from Transformers (BERT) model for sequence classification. We compare several state-of-the-art methods from the literature against our approach, which achieves a high performance throughout all the imbalanced ratios reaching an F1-score of 99.6% for the most extreme imbalanced ratio and an F1-score of 99.8% for balanced cases.