利用机器学习模型结合词法、主机和内容特征检测钓鱼网站

Samiya Hamadouche, Ouadjih Boudraa, Mohamed Gasmi
{"title":"利用机器学习模型结合词法、主机和内容特征检测钓鱼网站","authors":"Samiya Hamadouche, Ouadjih Boudraa, Mohamed Gasmi","doi":"10.4108/eetsis.4421","DOIUrl":null,"url":null,"abstract":"In cybersecurity field, identifying and dealing with threats from malicious websites (phishing, spam, and drive-by downloads, for example) is a major concern for the community. Consequently, the need for effective detection methods has become a necessity. Recent advances in Machine Learning (ML) have renewed interest in its application to a variety of cybersecurity challenges. When it comes to detecting phishing URLs, machine learning relies on specific attributes, such as lexical, host, and content based features. The main objective of our work is to propose, implement and evaluate a solution for identifying phishing URLs based on a combination of these feature sets. This paper focuses on using a new balanced dataset, extracting useful features from it, and selecting the optimal features using different feature selection techniques to build and conduct acomparative performance evaluation of four ML models (SVM, Decision Tree, Random Forest, and XGBoost). Results showed that the XGBoost model outperformed the others models, with an accuracy of 95.70% and a false negatives rate of 1.94%.","PeriodicalId":155438,"journal":{"name":"ICST Transactions on Scalable Information Systems","volume":" 10","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models\",\"authors\":\"Samiya Hamadouche, Ouadjih Boudraa, Mohamed Gasmi\",\"doi\":\"10.4108/eetsis.4421\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In cybersecurity field, identifying and dealing with threats from malicious websites (phishing, spam, and drive-by downloads, for example) is a major concern for the community. Consequently, the need for effective detection methods has become a necessity. Recent advances in Machine Learning (ML) have renewed interest in its application to a variety of cybersecurity challenges. When it comes to detecting phishing URLs, machine learning relies on specific attributes, such as lexical, host, and content based features. The main objective of our work is to propose, implement and evaluate a solution for identifying phishing URLs based on a combination of these feature sets. This paper focuses on using a new balanced dataset, extracting useful features from it, and selecting the optimal features using different feature selection techniques to build and conduct acomparative performance evaluation of four ML models (SVM, Decision Tree, Random Forest, and XGBoost). Results showed that the XGBoost model outperformed the others models, with an accuracy of 95.70% and a false negatives rate of 1.94%.\",\"PeriodicalId\":155438,\"journal\":{\"name\":\"ICST Transactions on Scalable Information Systems\",\"volume\":\" 10\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICST Transactions on Scalable Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4108/eetsis.4421\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICST Transactions on Scalable Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eetsis.4421","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在网络安全领域,识别和处理来自恶意网站的威胁(如网络钓鱼、垃圾邮件和偷渡式下载)是社会关注的一个主要问题。因此,需要有效的检测方法。机器学习(ML)的最新进展再次激发了人们将其应用于各种网络安全挑战的兴趣。在检测网络钓鱼 URL 时,机器学习依赖于特定的属性,如基于词法、主机和内容的特征。我们工作的主要目标是提出、实施和评估一种基于这些特征集组合的网络钓鱼 URL 识别解决方案。本文的重点是使用一个新的平衡数据集,从中提取有用的特征,并使用不同的特征选择技术来选择最佳特征,从而建立四个 ML 模型(SVM、决策树、随机森林和 XGBoost)并进行性能比较评估。结果显示,XGBoost 模型的准确率为 95.70%,误判率为 1.94%,优于其他模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models
In cybersecurity field, identifying and dealing with threats from malicious websites (phishing, spam, and drive-by downloads, for example) is a major concern for the community. Consequently, the need for effective detection methods has become a necessity. Recent advances in Machine Learning (ML) have renewed interest in its application to a variety of cybersecurity challenges. When it comes to detecting phishing URLs, machine learning relies on specific attributes, such as lexical, host, and content based features. The main objective of our work is to propose, implement and evaluate a solution for identifying phishing URLs based on a combination of these feature sets. This paper focuses on using a new balanced dataset, extracting useful features from it, and selecting the optimal features using different feature selection techniques to build and conduct acomparative performance evaluation of four ML models (SVM, Decision Tree, Random Forest, and XGBoost). Results showed that the XGBoost model outperformed the others models, with an accuracy of 95.70% and a false negatives rate of 1.94%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信