{"title":"基于SVM的网页分类器特征选择","authors":"Nhamo Mtetwa, M. Yousefi, V. Reddy","doi":"10.1109/ISCMI.2017.8279603","DOIUrl":null,"url":null,"abstract":"Machine-learning techniques are a handy tool for deriving insights from data extracted from the web. Because of the structure of web data extracted by web crawlers there is need for preprocessing the data to extract features that can be used to train a machine learning classifier. The number of available features that can be linked to a website is huge. Narrowing down to a minimum number of features required to drive a classifier has huge benefits. This paper presents a workflow that uses a set of metrics that can be used to reduce the numbers of features for training a support vector machine (SVM) for classifying webpages as fraudulent or not. The paper reports that a three quarter reduction in feature set size only incurs a 5% reduction in classification accuracy which has huge computational benefits.","PeriodicalId":119111,"journal":{"name":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Feature selection for an SVM based webpage classifier\",\"authors\":\"Nhamo Mtetwa, M. Yousefi, V. Reddy\",\"doi\":\"10.1109/ISCMI.2017.8279603\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine-learning techniques are a handy tool for deriving insights from data extracted from the web. Because of the structure of web data extracted by web crawlers there is need for preprocessing the data to extract features that can be used to train a machine learning classifier. The number of available features that can be linked to a website is huge. Narrowing down to a minimum number of features required to drive a classifier has huge benefits. This paper presents a workflow that uses a set of metrics that can be used to reduce the numbers of features for training a support vector machine (SVM) for classifying webpages as fraudulent or not. The paper reports that a three quarter reduction in feature set size only incurs a 5% reduction in classification accuracy which has huge computational benefits.\",\"PeriodicalId\":119111,\"journal\":{\"name\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCMI.2017.8279603\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCMI.2017.8279603","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature selection for an SVM based webpage classifier
Machine-learning techniques are a handy tool for deriving insights from data extracted from the web. Because of the structure of web data extracted by web crawlers there is need for preprocessing the data to extract features that can be used to train a machine learning classifier. The number of available features that can be linked to a website is huge. Narrowing down to a minimum number of features required to drive a classifier has huge benefits. This paper presents a workflow that uses a set of metrics that can be used to reduce the numbers of features for training a support vector machine (SVM) for classifying webpages as fraudulent or not. The paper reports that a three quarter reduction in feature set size only incurs a 5% reduction in classification accuracy which has huge computational benefits.