{"title":"网络钓鱼数据的SMOTE实施以加强网络安全","authors":"M. Ahsan, Rahul Gomes, A. Denton","doi":"10.1109/EIT.2018.8500086","DOIUrl":null,"url":null,"abstract":"Phishing is a form of cybersecurity threat where the criminal tries to gain access to users personal information by infecting their system using malware and viruses. Appearing to come from legitimate sources, it is very easy to fall in the phishing scam. As each phishing data contains features that are consistently different from another, using a predefined set of rules makes a system useless. Data mining techniques can be applied to collected network traffic and build models to predict future attacks. However, since most of the data packets are legitimate, the model tends to produce a bias towards positive results in this imbalanced dataset. In this study, we investigate how prediction accuracy varies in a balanced dataset against an imbalanced one. SMOTE is applied to balance the dataset. XGBoost, Random Forest and Support Vector Machines have been applied on the phishing dataset. Results show much higher accuracy rates with SMOTE application. The highest jump in accuracy has been recorded in XGBoost from 89.87% to 97.17% showing that SMOTE is an effective tool in phishing data monitoring.","PeriodicalId":188414,"journal":{"name":"2018 IEEE International Conference on Electro/Information Technology (EIT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"SMOTE Implementation on Phishing Data to Enhance Cybersecurity\",\"authors\":\"M. Ahsan, Rahul Gomes, A. Denton\",\"doi\":\"10.1109/EIT.2018.8500086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Phishing is a form of cybersecurity threat where the criminal tries to gain access to users personal information by infecting their system using malware and viruses. Appearing to come from legitimate sources, it is very easy to fall in the phishing scam. As each phishing data contains features that are consistently different from another, using a predefined set of rules makes a system useless. Data mining techniques can be applied to collected network traffic and build models to predict future attacks. However, since most of the data packets are legitimate, the model tends to produce a bias towards positive results in this imbalanced dataset. In this study, we investigate how prediction accuracy varies in a balanced dataset against an imbalanced one. SMOTE is applied to balance the dataset. XGBoost, Random Forest and Support Vector Machines have been applied on the phishing dataset. Results show much higher accuracy rates with SMOTE application. The highest jump in accuracy has been recorded in XGBoost from 89.87% to 97.17% showing that SMOTE is an effective tool in phishing data monitoring.\",\"PeriodicalId\":188414,\"journal\":{\"name\":\"2018 IEEE International Conference on Electro/Information Technology (EIT)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Electro/Information Technology (EIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EIT.2018.8500086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Electro/Information Technology (EIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIT.2018.8500086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SMOTE Implementation on Phishing Data to Enhance Cybersecurity
Phishing is a form of cybersecurity threat where the criminal tries to gain access to users personal information by infecting their system using malware and viruses. Appearing to come from legitimate sources, it is very easy to fall in the phishing scam. As each phishing data contains features that are consistently different from another, using a predefined set of rules makes a system useless. Data mining techniques can be applied to collected network traffic and build models to predict future attacks. However, since most of the data packets are legitimate, the model tends to produce a bias towards positive results in this imbalanced dataset. In this study, we investigate how prediction accuracy varies in a balanced dataset against an imbalanced one. SMOTE is applied to balance the dataset. XGBoost, Random Forest and Support Vector Machines have been applied on the phishing dataset. Results show much higher accuracy rates with SMOTE application. The highest jump in accuracy has been recorded in XGBoost from 89.87% to 97.17% showing that SMOTE is an effective tool in phishing data monitoring.