Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest

Michiavelly Rustam, Agung Brotokuncoro, Rusdianto Roestam
{"title":"Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest","authors":"Michiavelly Rustam, Agung Brotokuncoro, Rusdianto Roestam","doi":"10.38035/rrj.v6i4.873","DOIUrl":null,"url":null,"abstract":"Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.","PeriodicalId":333433,"journal":{"name":"Ranah Research : Journal of Multidisciplinary Research and Development","volume":"28 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ranah Research : Journal of Multidisciplinary Research and Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.38035/rrj.v6i4.873","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.
利用连续词袋和随机森林检测垃圾邮件
垃圾电子邮件构成了严重的网络威胁,因为骗子会使用各种手段欺骗个人泄露敏感信息或下载有害内容。例如,2023 年 6 月,印尼遭遇了约 651 万次垃圾邮件攻击,凸显了这一问题的广泛性。这些攻击经常采用欺骗策略,如冒充或虚假奖励承诺,诱骗毫无戒心的受害者。屈服于垃圾邮件可能会导致经济损失和其他严重后果。为了解决这一问题,本研究通过对电子邮件内容进行分类来检测网络钓鱼企图,从而解决这一紧迫问题。所提出的解决方案利用了运行时平台(如 Google Colab),并使用了连续词袋(CBOW)分析和随机森林方法。之所以选择 CBOW,是因为它能有效捕捉词与词之间的语义关系,使模型能从电子邮件内容中提取有意义的特征。另一方面,选择随机森林是因为它能够处理电子邮件分类任务中常见的不平衡数据集,确保在模型训练过程中公平地代表垃圾邮件和火腿邮件。通过将这两种技术相结合,我们旨在开发一种强大的分类模型,能够准确区分网络钓鱼(垃圾邮件)和合法(垃圾邮件)电子邮件,从而加强电子邮件安全措施。通过我们的方法,我们的目标是将 SpamAssassin 数据集分为火腿或垃圾邮件类别,预期精确率为 0.98,从而证明该模型在准确识别网络钓鱼电子邮件方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信