Using Machine Learning to Detect and Classify URLs: A Phishing Detection Approach

2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC) Pub Date : 2023-07-06 DOI:10.1109/ICESC57686.2023.10193559

Mahesh, Ananth, Dheepthi

{"title":"Using Machine Learning to Detect and Classify URLs: A Phishing Detection Approach","authors":"Mahesh, Ananth, Dheepthi","doi":"10.1109/ICESC57686.2023.10193559","DOIUrl":null,"url":null,"abstract":"It has become absolutely necessary to identify malicious URLs in real time due to the growing number of cyber-attacks and fraudulent activities that take place on the internet. Within the scope of this study, proposing a method that makes use of machine learning to identify four distinct categories of URLs: phishing, malware, benign, and defacement. The training and testing dataset using for our models contains over 651,191 URLs with a variety of features, such as the length of the URL, the presence or absence of symbols, the length of the hostname, the length of the path, and many more. In order to find the machine learning algorithm and architecture that produces the best results for the classification task, by investigating a variety of options. Based on the results of our experiments, a multi-layer perceptron (MLP) architecture performs significantly better than other models, achieving an accuracy of 95.6percent. This study has implemented a parallel data processing pipeline so that handle the large dataset. This pipeline preprocesses and extracts features from URLs in parallel, which significantly reduces the amount of time needed for training. Our proposed method offers a practical answer to the problem of identifying potentially harmful URLs and is adaptable enough to be incorporated into existing infrastructure in order to improve the safety of internet users.","PeriodicalId":235381,"journal":{"name":"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESC57686.2023.10193559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

It has become absolutely necessary to identify malicious URLs in real time due to the growing number of cyber-attacks and fraudulent activities that take place on the internet. Within the scope of this study, proposing a method that makes use of machine learning to identify four distinct categories of URLs: phishing, malware, benign, and defacement. The training and testing dataset using for our models contains over 651,191 URLs with a variety of features, such as the length of the URL, the presence or absence of symbols, the length of the hostname, the length of the path, and many more. In order to find the machine learning algorithm and architecture that produces the best results for the classification task, by investigating a variety of options. Based on the results of our experiments, a multi-layer perceptron (MLP) architecture performs significantly better than other models, achieving an accuracy of 95.6percent. This study has implemented a parallel data processing pipeline so that handle the large dataset. This pipeline preprocesses and extracts features from URLs in parallel, which significantly reduces the amount of time needed for training. Our proposed method offers a practical answer to the problem of identifying potentially harmful URLs and is adaptable enough to be incorporated into existing infrastructure in order to improve the safety of internet users.

查看原文本刊更多论文

使用机器学习检测和分类url:一种网络钓鱼检测方法

由于互联网上发生的网络攻击和欺诈活动越来越多，实时识别恶意url变得绝对必要。在本研究的范围内，提出了一种利用机器学习来识别四种不同类别的url的方法:网络钓鱼、恶意软件、良性和污损。用于我们模型的训练和测试数据集包含超过651,191个URL，这些URL具有各种各样的特征，例如URL的长度、符号的存在或不存在、主机名的长度、路径的长度等等。为了找到能够为分类任务产生最佳结果的机器学习算法和架构，通过调查各种选项。根据我们的实验结果，多层感知器(MLP)架构的性能明显优于其他模型，达到95.6%的准确率。本研究实现了一个并行数据处理管道，以处理大型数据集。该管道并行地从url中预处理和提取特征，这大大减少了训练所需的时间。我们提出的方法为识别潜在有害url的问题提供了一个实用的答案，并且具有足够的适应性，可以整合到现有的基础设施中，以提高互联网用户的安全性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)

自引率

0.00%

发文量