基于自然语言处理和机器学习特征融合的恶意网页检测

P. G, Devi R
{"title":"基于自然语言处理和机器学习特征融合的恶意网页检测","authors":"P. G, Devi R","doi":"10.1109/ICECAA58104.2023.10212120","DOIUrl":null,"url":null,"abstract":"Malicious websites are purposefully designed to deceive internet users to steal sensitive personal information, infect the victim's system with malware, cause financial losses, and damage the victim's reputation. Finding these pages or links is hard for internet users. Such websites are discovered using detection tools. The majority of detection techniques use blacklisting or whitelisting strategies to find and prevent malicious websites. However, compiling such a sizable list of website links is a time-consuming job that is challenging to update regularly. Therefore, the researchers employ machine learning-based methods to identify these fraudulent connections. These methods are based on the features taken from URLs or web pages. Additionally, features such as DNS details, webpage reputation, and visual similarity data are used. However, these features are few and do not fully utilize the URLs or website contents. This work focuses on merging URL lexical features and content-based features for malicious webpage detection in order to fully exploit the dataset's potential. Natural language processing methods like Hashing, Count, and Term Frequency - Inverse Document Frequency (TF-IDF) vectorizers are employed to extract features from the content of Web pages. The suggested approach's efficiency is evaluated by using the most well-known machine learning methods. The outcome shows that the Count vectorizer with Random Forest achieves a higher accuracy of 91.17% with 500 features.","PeriodicalId":114624,"journal":{"name":"2023 2nd International Conference on Edge Computing and Applications (ICECAA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning\",\"authors\":\"P. G, Devi R\",\"doi\":\"10.1109/ICECAA58104.2023.10212120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Malicious websites are purposefully designed to deceive internet users to steal sensitive personal information, infect the victim's system with malware, cause financial losses, and damage the victim's reputation. Finding these pages or links is hard for internet users. Such websites are discovered using detection tools. The majority of detection techniques use blacklisting or whitelisting strategies to find and prevent malicious websites. However, compiling such a sizable list of website links is a time-consuming job that is challenging to update regularly. Therefore, the researchers employ machine learning-based methods to identify these fraudulent connections. These methods are based on the features taken from URLs or web pages. Additionally, features such as DNS details, webpage reputation, and visual similarity data are used. However, these features are few and do not fully utilize the URLs or website contents. This work focuses on merging URL lexical features and content-based features for malicious webpage detection in order to fully exploit the dataset's potential. Natural language processing methods like Hashing, Count, and Term Frequency - Inverse Document Frequency (TF-IDF) vectorizers are employed to extract features from the content of Web pages. The suggested approach's efficiency is evaluated by using the most well-known machine learning methods. The outcome shows that the Count vectorizer with Random Forest achieves a higher accuracy of 91.17% with 500 features.\",\"PeriodicalId\":114624,\"journal\":{\"name\":\"2023 2nd International Conference on Edge Computing and Applications (ICECAA)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 2nd International Conference on Edge Computing and Applications (ICECAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECAA58104.2023.10212120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference on Edge Computing and Applications (ICECAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECAA58104.2023.10212120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

恶意网站是有目的地欺骗互联网用户窃取敏感的个人信息,用恶意软件感染受害者的系统,造成经济损失,损害受害者的声誉。互联网用户很难找到这些页面或链接。这些网站是通过检测工具发现的。大多数检测技术使用黑名单或白名单策略来查找和阻止恶意网站。然而,编制如此庞大的网站链接列表是一项耗时的工作,并且具有定期更新的挑战性。因此,研究人员采用基于机器学习的方法来识别这些欺诈性连接。这些方法基于从url或网页中获取的特性。此外,还使用了DNS详细信息、网页声誉和视觉相似性数据等功能。然而,这些功能很少,并没有充分利用网址或网站内容。这项工作的重点是合并URL词法特征和基于内容的特征来检测恶意网页,以充分利用数据集的潜力。使用哈希、计数和术语频率-逆文档频率(TF-IDF)矢量器等自然语言处理方法从Web页面的内容中提取特征。通过使用最著名的机器学习方法来评估所建议方法的效率。结果表明,随机森林的计数矢量器在500个特征的情况下,准确率达到了91.17%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Malicious Webpage Detection Based on Feature Fusion Using Natural Language Processing and Machine Learning
Malicious websites are purposefully designed to deceive internet users to steal sensitive personal information, infect the victim's system with malware, cause financial losses, and damage the victim's reputation. Finding these pages or links is hard for internet users. Such websites are discovered using detection tools. The majority of detection techniques use blacklisting or whitelisting strategies to find and prevent malicious websites. However, compiling such a sizable list of website links is a time-consuming job that is challenging to update regularly. Therefore, the researchers employ machine learning-based methods to identify these fraudulent connections. These methods are based on the features taken from URLs or web pages. Additionally, features such as DNS details, webpage reputation, and visual similarity data are used. However, these features are few and do not fully utilize the URLs or website contents. This work focuses on merging URL lexical features and content-based features for malicious webpage detection in order to fully exploit the dataset's potential. Natural language processing methods like Hashing, Count, and Term Frequency - Inverse Document Frequency (TF-IDF) vectorizers are employed to extract features from the content of Web pages. The suggested approach's efficiency is evaluated by using the most well-known machine learning methods. The outcome shows that the Count vectorizer with Random Forest achieves a higher accuracy of 91.17% with 500 features.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信