SMOTE平衡数据中恶意网站检测特征选择技术的比较分析

RS Open Journal on Innovative Communication Technologies Pub Date : 2021-04-10 DOI:10.46470/03D8FFBD.993CF635

Naman Bhoj, Alpika Tripathi, Granth Singh Bisht, Adarsh Raj Dwivedi, Bishwajeet K. Pandey, Nitin Chhimwal

{"title":"SMOTE平衡数据中恶意网站检测特征选择技术的比较分析","authors":"Naman Bhoj, Alpika Tripathi, Granth Singh Bisht, Adarsh Raj Dwivedi, Bishwajeet K. Pandey, Nitin Chhimwal","doi":"10.46470/03D8FFBD.993CF635","DOIUrl":null,"url":null,"abstract":"The advancement in network technology has led to an exponential rise in the number of internet users across the globe. The increase in internet usage has resulted in an increase in both the number of malicious websites and cybercrimes reported over the years. Therefore, it has become critical to devise an intelligent solution that can detect malicious websites and be used in real-time systems. In our paper, we perform a comparative analysis of various feature selection techniques to build a time-efficient and accurate predictive model. To build our predictive model, a set of features are selected by feature selection methods. The selected features consist of at least 70% of the categorical features in all feature selection techniques examined in this paper. Keeping the end goal of real-time deployment of models in context the cost of processing or storing these features is far cheaper when compared to text or image-based features. We started out with a class imbalance in our data which was later dealt with using the Synthetic Minority Oversampling Technique. Our proposed model also bested the existing work in the literature when compared over various evaluation metrics. The result indicated that Embedded feature selection was the best technique considering the accuracy of the model. The Filter-based technique may also be used in the context of developing a low latency system at the cost of the accuracy of the model.","PeriodicalId":225911,"journal":{"name":"RS Open Journal on Innovative Communication Technologies","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Comparative Analysis of Feature Selection Techniques for Malicious Website Detection in SMOTE Balanced Data\",\"authors\":\"Naman Bhoj, Alpika Tripathi, Granth Singh Bisht, Adarsh Raj Dwivedi, Bishwajeet K. Pandey, Nitin Chhimwal\",\"doi\":\"10.46470/03D8FFBD.993CF635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The advancement in network technology has led to an exponential rise in the number of internet users across the globe. The increase in internet usage has resulted in an increase in both the number of malicious websites and cybercrimes reported over the years. Therefore, it has become critical to devise an intelligent solution that can detect malicious websites and be used in real-time systems. In our paper, we perform a comparative analysis of various feature selection techniques to build a time-efficient and accurate predictive model. To build our predictive model, a set of features are selected by feature selection methods. The selected features consist of at least 70% of the categorical features in all feature selection techniques examined in this paper. Keeping the end goal of real-time deployment of models in context the cost of processing or storing these features is far cheaper when compared to text or image-based features. We started out with a class imbalance in our data which was later dealt with using the Synthetic Minority Oversampling Technique. Our proposed model also bested the existing work in the literature when compared over various evaluation metrics. The result indicated that Embedded feature selection was the best technique considering the accuracy of the model. The Filter-based technique may also be used in the context of developing a low latency system at the cost of the accuracy of the model.\",\"PeriodicalId\":225911,\"journal\":{\"name\":\"RS Open Journal on Innovative Communication Technologies\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RS Open Journal on Innovative Communication Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46470/03D8FFBD.993CF635\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RS Open Journal on Innovative Communication Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46470/03D8FFBD.993CF635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

网络技术的进步导致全球互联网用户数量呈指数级增长。近年来，互联网使用量的增加导致恶意网站和网络犯罪的数量增加。因此，设计一种能够检测恶意网站并在实时系统中使用的智能解决方案变得至关重要。在本文中，我们对各种特征选择技术进行了比较分析，以建立一个时间高效和准确的预测模型。为了构建我们的预测模型，通过特征选择方法选择一组特征。在本文所研究的所有特征选择技术中，所选择的特征至少包含70%的分类特征。与基于文本或图像的特性相比，处理或存储这些特性的成本要低得多，从而保持模型在上下文中实时部署的最终目标。我们从数据中的类不平衡开始，后来使用合成少数派过采样技术处理。当比较各种评估指标时，我们提出的模型也优于文献中的现有工作。结果表明，考虑到模型的准确性，嵌入式特征选择是最好的方法。基于过滤器的技术也可用于开发低延迟系统，但代价是模型的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparative Analysis of Feature Selection Techniques for Malicious Website Detection in SMOTE Balanced Data

The advancement in network technology has led to an exponential rise in the number of internet users across the globe. The increase in internet usage has resulted in an increase in both the number of malicious websites and cybercrimes reported over the years. Therefore, it has become critical to devise an intelligent solution that can detect malicious websites and be used in real-time systems. In our paper, we perform a comparative analysis of various feature selection techniques to build a time-efficient and accurate predictive model. To build our predictive model, a set of features are selected by feature selection methods. The selected features consist of at least 70% of the categorical features in all feature selection techniques examined in this paper. Keeping the end goal of real-time deployment of models in context the cost of processing or storing these features is far cheaper when compared to text or image-based features. We started out with a class imbalance in our data which was later dealt with using the Synthetic Minority Oversampling Technique. Our proposed model also bested the existing work in the literature when compared over various evaluation metrics. The result indicated that Embedded feature selection was the best technique considering the accuracy of the model. The Filter-based technique may also be used in the context of developing a low latency system at the cost of the accuracy of the model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

RS Open Journal on Innovative Communication Technologies

自引率

0.00%

发文量