An explainable feature selection framework for web phishing detection with machine learning

Data Science and Management Pub Date : 2024-08-22 DOI:10.1016/j.dsm.2024.08.004

Sakib Shahriar Shafin

{"title":"An explainable feature selection framework for web phishing detection with machine learning","authors":"Sakib Shahriar Shafin","doi":"10.1016/j.dsm.2024.08.004","DOIUrl":null,"url":null,"abstract":"<div><div>In the evolving landscape of cyber threats, phishing attacks pose significant challenges, particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy. Conventional and machine learning (ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics. Furthermore, newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable. Hence, effective detection depends on identifying the most critical features. Traditional feature selection (FS) methods often struggle to enhance ML model performance and instead decrease it. To combat these issues, we propose an innovative method using explainable AI (XAI) to enhance FS in ML models and improve the identification of phishing websites. Specifically, we employ SHapley Additive exPlanations (SHAP) for global perspective and aggregated local interpretable model-agnostic explanations (LIME) to determine specific localized patterns. The proposed SHAP and LIME-aggregated FS (SLA-FS) framework pinpoints the most informative features, enabling more precise, swift, and adaptable phishing detection. Applying this approach to an up-to-date web phishing dataset, we evaluate the performance of three ML models before and after FS to assess their effectiveness. Our findings reveal that random forest (RF), with an accuracy of 97.41% and XGBoost (XGB) at 97.21% significantly benefit from the SLA-FS framework, while k-nearest neighbors lags. Our framework increases the accuracy of RF and XGB by 0.65% and 0.41%, respectively, outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset, showcasing its potential.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 2","pages":"Pages 127-136"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764924000419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the evolving landscape of cyber threats, phishing attacks pose significant challenges, particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy. Conventional and machine learning (ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics. Furthermore, newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable. Hence, effective detection depends on identifying the most critical features. Traditional feature selection (FS) methods often struggle to enhance ML model performance and instead decrease it. To combat these issues, we propose an innovative method using explainable AI (XAI) to enhance FS in ML models and improve the identification of phishing websites. Specifically, we employ SHapley Additive exPlanations (SHAP) for global perspective and aggregated local interpretable model-agnostic explanations (LIME) to determine specific localized patterns. The proposed SHAP and LIME-aggregated FS (SLA-FS) framework pinpoints the most informative features, enabling more precise, swift, and adaptable phishing detection. Applying this approach to an up-to-date web phishing dataset, we evaluate the performance of three ML models before and after FS to assess their effectiveness. Our findings reveal that random forest (RF), with an accuracy of 97.41% and XGBoost (XGB) at 97.21% significantly benefit from the SLA-FS framework, while k-nearest neighbors lags. Our framework increases the accuracy of RF and XGB by 0.65% and 0.41%, respectively, outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset, showcasing its potential.

查看原文本刊更多论文

基于机器学习的网络钓鱼检测的可解释特征选择框架

在不断发展的网络威胁中，网络钓鱼攻击构成了重大挑战，特别是通过欺骗性网页设计，以合法性为幌子提取敏感信息。传统的和基于机器学习（ML）的检测系统很难检测到网络钓鱼网站，因为它们的策略不断变化。此外，较新的网络钓鱼网站表现出不易察觉的微妙和专业隐藏的指标。因此，有效的检测取决于识别最关键的特征。传统的特征选择（FS）方法往往难以提高机器学习模型的性能，反而降低了性能。为了解决这些问题，我们提出了一种使用可解释AI （XAI）的创新方法来增强ML模型中的FS并改进对钓鱼网站的识别。具体而言，我们采用SHapley加性解释（SHAP）来确定全局视角，并采用聚合的局部可解释模型不可知论解释（LIME）来确定特定的局部模式。所提出的SHAP和lime聚合FS （SLA-FS）框架确定了最具信息量的特征，从而实现了更精确、更快速和更适应性的网络钓鱼检测。将这种方法应用于最新的网络钓鱼数据集，我们评估了FS前后三种ML模型的性能，以评估其有效性。我们的研究结果表明，随机森林（RF）（准确率为97.41%）和XGBoost (XGB)（准确率为97.21%）明显受益于SLA-FS框架，而k近邻滞后。我们的框架将RF和XGB的精度分别提高了0.65%和0.41%，优于传统的过滤器或包装方法以及在该数据集上评估的任何先前方法，显示了其潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Science and Management

CiteScore

7.50

自引率

0.00%

发文量