Chanchal Patra , Debasis Giri , Sutanu Nandi , Ashok Kumar Das , Mohammed J.F. Alenazi
{"title":"Phishing email detection using vector similarity search leveraging transformer-based word embedding","authors":"Chanchal Patra , Debasis Giri , Sutanu Nandi , Ashok Kumar Das , Mohammed J.F. Alenazi","doi":"10.1016/j.compeleceng.2025.110403","DOIUrl":null,"url":null,"abstract":"<div><div>As cybercrime increases, using email cautiously is crucial. Phishing emails are a major threat, often exploited to steal sensitive data and cause financial losses. While anti-phishing techniques exist, evolving phishing tactics make countering them challenging. This study proposes a phishing detection system using transformer-based word embedding and vector similarity search. Pre-trained models like Dense Passage Retrieval (DPR) create high-dimensional vector embeddings from emails, stored in a vector database for real-time similarity searches. The proposed approach outperforms traditional machine learning by automating feature extraction and improving similarity search efficiency, making it more effective in detecting phishing emails. Empirical evaluation has been conducted using three publicly available datasets Enron, Nazario phishing corpora, and the Phishing validation emails dataset. The system demonstrates the superior performance, achieving 98.43% accuracy, 98.44% precision, 98.38% recall, 98.41% F1-score, and an area under the curves (AUC) of 0.984 using cosine similarity.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"124 ","pages":"Article 110403"},"PeriodicalIF":4.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625003465","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
As cybercrime increases, using email cautiously is crucial. Phishing emails are a major threat, often exploited to steal sensitive data and cause financial losses. While anti-phishing techniques exist, evolving phishing tactics make countering them challenging. This study proposes a phishing detection system using transformer-based word embedding and vector similarity search. Pre-trained models like Dense Passage Retrieval (DPR) create high-dimensional vector embeddings from emails, stored in a vector database for real-time similarity searches. The proposed approach outperforms traditional machine learning by automating feature extraction and improving similarity search efficiency, making it more effective in detecting phishing emails. Empirical evaluation has been conducted using three publicly available datasets Enron, Nazario phishing corpora, and the Phishing validation emails dataset. The system demonstrates the superior performance, achieving 98.43% accuracy, 98.44% precision, 98.38% recall, 98.41% F1-score, and an area under the curves (AUC) of 0.984 using cosine similarity.
期刊介绍:
The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency.
Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.