{"title":"使用机器学习检测和分类url:一种网络钓鱼检测方法","authors":"Mahesh, Ananth, Dheepthi","doi":"10.1109/ICESC57686.2023.10193559","DOIUrl":null,"url":null,"abstract":"It has become absolutely necessary to identify malicious URLs in real time due to the growing number of cyber-attacks and fraudulent activities that take place on the internet. Within the scope of this study, proposing a method that makes use of machine learning to identify four distinct categories of URLs: phishing, malware, benign, and defacement. The training and testing dataset using for our models contains over 651,191 URLs with a variety of features, such as the length of the URL, the presence or absence of symbols, the length of the hostname, the length of the path, and many more. In order to find the machine learning algorithm and architecture that produces the best results for the classification task, by investigating a variety of options. Based on the results of our experiments, a multi-layer perceptron (MLP) architecture performs significantly better than other models, achieving an accuracy of 95.6percent. This study has implemented a parallel data processing pipeline so that handle the large dataset. This pipeline preprocesses and extracts features from URLs in parallel, which significantly reduces the amount of time needed for training. Our proposed method offers a practical answer to the problem of identifying potentially harmful URLs and is adaptable enough to be incorporated into existing infrastructure in order to improve the safety of internet users.","PeriodicalId":235381,"journal":{"name":"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Using Machine Learning to Detect and Classify URLs: A Phishing Detection Approach\",\"authors\":\"Mahesh, Ananth, Dheepthi\",\"doi\":\"10.1109/ICESC57686.2023.10193559\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It has become absolutely necessary to identify malicious URLs in real time due to the growing number of cyber-attacks and fraudulent activities that take place on the internet. Within the scope of this study, proposing a method that makes use of machine learning to identify four distinct categories of URLs: phishing, malware, benign, and defacement. The training and testing dataset using for our models contains over 651,191 URLs with a variety of features, such as the length of the URL, the presence or absence of symbols, the length of the hostname, the length of the path, and many more. In order to find the machine learning algorithm and architecture that produces the best results for the classification task, by investigating a variety of options. Based on the results of our experiments, a multi-layer perceptron (MLP) architecture performs significantly better than other models, achieving an accuracy of 95.6percent. This study has implemented a parallel data processing pipeline so that handle the large dataset. This pipeline preprocesses and extracts features from URLs in parallel, which significantly reduces the amount of time needed for training. Our proposed method offers a practical answer to the problem of identifying potentially harmful URLs and is adaptable enough to be incorporated into existing infrastructure in order to improve the safety of internet users.\",\"PeriodicalId\":235381,\"journal\":{\"name\":\"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICESC57686.2023.10193559\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESC57686.2023.10193559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Using Machine Learning to Detect and Classify URLs: A Phishing Detection Approach
It has become absolutely necessary to identify malicious URLs in real time due to the growing number of cyber-attacks and fraudulent activities that take place on the internet. Within the scope of this study, proposing a method that makes use of machine learning to identify four distinct categories of URLs: phishing, malware, benign, and defacement. The training and testing dataset using for our models contains over 651,191 URLs with a variety of features, such as the length of the URL, the presence or absence of symbols, the length of the hostname, the length of the path, and many more. In order to find the machine learning algorithm and architecture that produces the best results for the classification task, by investigating a variety of options. Based on the results of our experiments, a multi-layer perceptron (MLP) architecture performs significantly better than other models, achieving an accuracy of 95.6percent. This study has implemented a parallel data processing pipeline so that handle the large dataset. This pipeline preprocesses and extracts features from URLs in parallel, which significantly reduces the amount of time needed for training. Our proposed method offers a practical answer to the problem of identifying potentially harmful URLs and is adaptable enough to be incorporated into existing infrastructure in order to improve the safety of internet users.