{"title":"Comparison of Gradient Boosting and Extreme Boosting Ensemble Methods for Webpage Classification","authors":"J. Dutta, Yong Woon Kim, Dalia Dominic","doi":"10.1109/ICRCICN50933.2020.9296176","DOIUrl":null,"url":null,"abstract":"Web page classification is an important task in various areas like web content filtering, contextual advertising and maintaining or expanding web directories etc. Machine Learning methods have been found to perform well to classify web pages, and ensemble models have been used to improve the results obtained from single classifiers. The Gradient Boosting and Extreme Boosting ensemble models are used in this work for binary classification. The dataset containing URLs of web pages have been collected manually. The comparison between the two boosting algorithms validated the improvement in accuracy and speed obtained through Extreme boosting. Extreme boosting has been found to be around ten times faster than Gradient boosting and also shows improvement in accuracy. The effect of three preprocessing techniques; lemmatization, stop words removal and regular expressions shows that these preprocessing techniques improves the accuracy of the results but not significantly.","PeriodicalId":138966,"journal":{"name":"2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRCICN50933.2020.9296176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Web page classification is an important task in various areas like web content filtering, contextual advertising and maintaining or expanding web directories etc. Machine Learning methods have been found to perform well to classify web pages, and ensemble models have been used to improve the results obtained from single classifiers. The Gradient Boosting and Extreme Boosting ensemble models are used in this work for binary classification. The dataset containing URLs of web pages have been collected manually. The comparison between the two boosting algorithms validated the improvement in accuracy and speed obtained through Extreme boosting. Extreme boosting has been found to be around ten times faster than Gradient boosting and also shows improvement in accuracy. The effect of three preprocessing techniques; lemmatization, stop words removal and regular expressions shows that these preprocessing techniques improves the accuracy of the results but not significantly.