网络钓鱼邮件检测中自然语言处理和机器学习方法的比较

Proceedings of the 16th International Conference on Availability, Reliability and Security Pub Date : 2021-08-16 DOI:10.1145/3465481.3469205

Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis

{"title":"网络钓鱼邮件检测中自然语言处理和机器学习方法的比较","authors":"Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis","doi":"10.1145/3465481.3469205","DOIUrl":null,"url":null,"abstract":"Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.","PeriodicalId":417395,"journal":{"name":"Proceedings of the 16th International Conference on Availability, Reliability and Security","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection\",\"authors\":\"Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis\",\"doi\":\"10.1145/3465481.3469205\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.\",\"PeriodicalId\":417395,\"journal\":{\"name\":\"Proceedings of the 16th International Conference on Availability, Reliability and Security\",\"volume\":\"102 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3465481.3469205\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3465481.3469205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

网络钓鱼是最常用的恶意攻击，攻击者通常通过电子邮件冒充受信任的人或实体，从受害者那里获取私人信息。尽管网络钓鱼电子邮件攻击几十年来一直是一种已知的网络犯罪策略，但由于COVID-19大流行，它们的使用范围在过去几年中有所扩大，攻击者利用人们的恐慌来引诱受害者。因此，在网络钓鱼邮件检测领域还需要进一步的研究。最近的网络钓鱼电子邮件检测解决方案从电子邮件正文中提取具有代表性的基于文本的特征，已被证明是解决这些威胁的适当策略。本文提出了一种比较自然语言处理(TF-IDF, Word2Vec和BERT)和机器学习(随机森林，决策树，逻辑回归，梯度增强树和朴素贝叶斯)方法在网络钓鱼电子邮件检测中的组合使用方法。评估是在两个数据集上进行的，一个是平衡的，一个是不平衡的，这两个数据集都由来自著名的安然语料库的电子邮件和来自Nazario网络钓鱼语料库的最新电子邮件组成。平衡数据集的最佳组合是Word2Vec与随机森林算法的结合，而不平衡数据集的最佳组合是Word2Vec与Logistic回归算法的结合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量