Phishing email detection using vector similarity search leveraging transformer-based word embedding

IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Chanchal Patra , Debasis Giri , Sutanu Nandi , Ashok Kumar Das , Mohammed J.F. Alenazi
{"title":"Phishing email detection using vector similarity search leveraging transformer-based word embedding","authors":"Chanchal Patra ,&nbsp;Debasis Giri ,&nbsp;Sutanu Nandi ,&nbsp;Ashok Kumar Das ,&nbsp;Mohammed J.F. Alenazi","doi":"10.1016/j.compeleceng.2025.110403","DOIUrl":null,"url":null,"abstract":"<div><div>As cybercrime increases, using email cautiously is crucial. Phishing emails are a major threat, often exploited to steal sensitive data and cause financial losses. While anti-phishing techniques exist, evolving phishing tactics make countering them challenging. This study proposes a phishing detection system using transformer-based word embedding and vector similarity search. Pre-trained models like Dense Passage Retrieval (DPR) create high-dimensional vector embeddings from emails, stored in a vector database for real-time similarity searches. The proposed approach outperforms traditional machine learning by automating feature extraction and improving similarity search efficiency, making it more effective in detecting phishing emails. Empirical evaluation has been conducted using three publicly available datasets Enron, Nazario phishing corpora, and the Phishing validation emails dataset. The system demonstrates the superior performance, achieving 98.43% accuracy, 98.44% precision, 98.38% recall, 98.41% F1-score, and an area under the curves (AUC) of 0.984 using cosine similarity.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"124 ","pages":"Article 110403"},"PeriodicalIF":4.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625003465","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

As cybercrime increases, using email cautiously is crucial. Phishing emails are a major threat, often exploited to steal sensitive data and cause financial losses. While anti-phishing techniques exist, evolving phishing tactics make countering them challenging. This study proposes a phishing detection system using transformer-based word embedding and vector similarity search. Pre-trained models like Dense Passage Retrieval (DPR) create high-dimensional vector embeddings from emails, stored in a vector database for real-time similarity searches. The proposed approach outperforms traditional machine learning by automating feature extraction and improving similarity search efficiency, making it more effective in detecting phishing emails. Empirical evaluation has been conducted using three publicly available datasets Enron, Nazario phishing corpora, and the Phishing validation emails dataset. The system demonstrates the superior performance, achieving 98.43% accuracy, 98.44% precision, 98.38% recall, 98.41% F1-score, and an area under the curves (AUC) of 0.984 using cosine similarity.
利用基于转换器的词嵌入的向量相似度搜索来检测钓鱼邮件
随着网络犯罪的增加,谨慎使用电子邮件至关重要。网络钓鱼电子邮件是一个主要威胁,经常被用来窃取敏感数据并造成经济损失。虽然存在反网络钓鱼技术,但不断发展的网络钓鱼策略使其具有挑战性。本文提出了一种基于变换词嵌入和向量相似度搜索的网络钓鱼检测系统。密集通道检索(DPR)等预先训练的模型从电子邮件中创建高维向量嵌入,存储在矢量数据库中用于实时相似性搜索。该方法通过自动化特征提取和提高相似性搜索效率来优于传统的机器学习,使其在检测网络钓鱼邮件方面更加有效。使用三个公开可用的数据集安然、Nazario网络钓鱼语料库和网络钓鱼验证电子邮件数据集进行了实证评估。该系统的准确率为98.43%,精密度为98.44%,召回率为98.38%,f1评分为98.41%,曲线下面积(AUC)为0.984。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers & Electrical Engineering
Computers & Electrical Engineering 工程技术-工程:电子与电气
CiteScore
9.20
自引率
7.00%
发文量
661
审稿时长
47 days
期刊介绍: The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信