基于关键字匹配的网络钓鱼消息检测

Q2 Social Sciences

Australian Journal of Telecommunications and the Digital Economy Pub Date : 2023-09-30 DOI:10.18080/jtde.v11n3.776

Keng-Theen Tham, Kok-Why Ng, Su-Cheng Haw

{"title":"基于关键字匹配的网络钓鱼消息检测","authors":"Keng-Theen Tham, Kok-Why Ng, Su-Cheng Haw","doi":"10.18080/jtde.v11n3.776","DOIUrl":null,"url":null,"abstract":"This paper proposes to use the Naïve Bayes-based algorithm for phishing detection, specifically in spam emails. The paper compares probability-based and frequency-based approaches and investigates the impact of imbalanced datasets and the use of stemming as a natural language processing (NLP) technique. Results show that both algorithms perform similarly in spam detection, with the choice between them depending on factors such as efficiency and scalability. Accuracy is influenced by the dataset configuration and stemming. Imbalanced datasets lead to higher accuracy in detecting emails in the majority class, while they struggle to classify minority-class emails. In contrast, balanced datasets yield overall high accuracy for both spam and ham email identification. This study reveals that stemming has a minor impact on algorithm performance, occasionally decreasing in accuracy due to word grouping. Balancing the dataset is crucial for improving algorithm performance and achieving accurate spam email detection. Hence, both probability-based and frequency-based Naïve Bayes algorithms are effective for phishing detection using balanced datasets. The frequency-based approach, with a balanced dataset and stemming, achieves a balanced performance between recall and precision, while the probability-based method with a balanced dataset and no stemming prioritises overall accuracy.","PeriodicalId":37752,"journal":{"name":"Australian Journal of Telecommunications and the Digital Economy","volume":"160 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Phishing Message Detection Based on Keyword Matching\",\"authors\":\"Keng-Theen Tham, Kok-Why Ng, Su-Cheng Haw\",\"doi\":\"10.18080/jtde.v11n3.776\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes to use the Naïve Bayes-based algorithm for phishing detection, specifically in spam emails. The paper compares probability-based and frequency-based approaches and investigates the impact of imbalanced datasets and the use of stemming as a natural language processing (NLP) technique. Results show that both algorithms perform similarly in spam detection, with the choice between them depending on factors such as efficiency and scalability. Accuracy is influenced by the dataset configuration and stemming. Imbalanced datasets lead to higher accuracy in detecting emails in the majority class, while they struggle to classify minority-class emails. In contrast, balanced datasets yield overall high accuracy for both spam and ham email identification. This study reveals that stemming has a minor impact on algorithm performance, occasionally decreasing in accuracy due to word grouping. Balancing the dataset is crucial for improving algorithm performance and achieving accurate spam email detection. Hence, both probability-based and frequency-based Naïve Bayes algorithms are effective for phishing detection using balanced datasets. The frequency-based approach, with a balanced dataset and stemming, achieves a balanced performance between recall and precision, while the probability-based method with a balanced dataset and no stemming prioritises overall accuracy.\",\"PeriodicalId\":37752,\"journal\":{\"name\":\"Australian Journal of Telecommunications and the Digital Economy\",\"volume\":\"160 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australian Journal of Telecommunications and the Digital Economy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18080/jtde.v11n3.776\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australian Journal of Telecommunications and the Digital Economy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18080/jtde.v11n3.776","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 0

摘要

本文提出使用Naïve基于贝叶斯的网络钓鱼检测算法，特别是在垃圾邮件中。本文比较了基于概率和基于频率的方法，并研究了不平衡数据集的影响以及将词干提取作为自然语言处理(NLP)技术的使用。结果表明，这两种算法在垃圾邮件检测方面的表现相似，它们之间的选择取决于效率和可扩展性等因素。准确性受数据集配置和词干提取的影响。不平衡的数据集导致在检测大多数类别的电子邮件时具有更高的准确性，而在对少数类别的电子邮件进行分类时却很困难。相比之下，平衡的数据集对垃圾邮件和业余电子邮件的识别产生了总体上较高的准确性。本研究表明，词干提取对算法性能的影响较小，偶尔会由于词分组而降低准确性。平衡数据集对于提高算法性能和实现准确的垃圾邮件检测至关重要。因此，基于概率和基于频率的Naïve贝叶斯算法对于使用平衡数据集的网络钓鱼检测都是有效的。基于频率的方法，具有平衡的数据集和词干提取，实现了召回率和精度之间的平衡性能，而基于概率的方法，具有平衡的数据集和无词干提取，优先考虑整体准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Phishing Message Detection Based on Keyword Matching

This paper proposes to use the Naïve Bayes-based algorithm for phishing detection, specifically in spam emails. The paper compares probability-based and frequency-based approaches and investigates the impact of imbalanced datasets and the use of stemming as a natural language processing (NLP) technique. Results show that both algorithms perform similarly in spam detection, with the choice between them depending on factors such as efficiency and scalability. Accuracy is influenced by the dataset configuration and stemming. Imbalanced datasets lead to higher accuracy in detecting emails in the majority class, while they struggle to classify minority-class emails. In contrast, balanced datasets yield overall high accuracy for both spam and ham email identification. This study reveals that stemming has a minor impact on algorithm performance, occasionally decreasing in accuracy due to word grouping. Balancing the dataset is crucial for improving algorithm performance and achieving accurate spam email detection. Hence, both probability-based and frequency-based Naïve Bayes algorithms are effective for phishing detection using balanced datasets. The frequency-based approach, with a balanced dataset and stemming, achieves a balanced performance between recall and precision, while the probability-based method with a balanced dataset and no stemming prioritises overall accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australian Journal of Telecommunications and the Digital Economy Social Sciences-Communication

CiteScore

1.60

自引率

0.00%

发文量

期刊介绍： The Journal of Telecommunications and the Digital Economy (JTDE) is an international, open-access, high quality, peer reviewed journal, indexed by Scopus and Google Scholar, covering innovative research and practice in Telecommunications, Digital Economy and Applications. The mission of JTDE is to further through publication the objective of advancing learning, knowledge and research worldwide. The JTDE publishes peer reviewed papers that may take the following form: *Research Paper - a paper making an original contribution to engineering knowledge. *Special Interest Paper – a report on significant aspects of a major or notable project. *Review Paper for specialists – an overview of a relevant area intended for specialists in the field covered. *Review Paper for non-specialists – an overview of a relevant area suitable for a reader with an electrical/electronics background. *Public Policy Discussion - a paper that identifies or discusses public policy and includes investigation of legislation, regulation and what is happening around the world including best practice *Tutorial Paper – a paper that explains an important subject or clarifies the approach to an area of design or investigation. *Technical Note – a technical note or letter to the Editors that is not sufficiently developed or extensive in scope to constitute a full paper. *Industry Case Study - a paper that provides details of industry practices utilising a case study to provide an understanding of what is occurring and how the outcomes have been achieved. *Discussion – a contribution to discuss a published paper to which the original author''s response will be sought. Historical - a paper covering a historical topic related to telecommunications or the digital economy.