{"title":"预测搜索引擎排名的特征选择和分类方法","authors":"Willy K. Portier, Yujian Li, B. A. Kouassi","doi":"10.1145/3432291.3432309","DOIUrl":null,"url":null,"abstract":"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.","PeriodicalId":126684,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Feature Selection and Classification Methods for Predicting Search Engine Ranking\",\"authors\":\"Willy K. Portier, Yujian Li, B. A. Kouassi\",\"doi\":\"10.1145/3432291.3432309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.\",\"PeriodicalId\":126684,\"journal\":{\"name\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3432291.3432309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3432291.3432309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature Selection and Classification Methods for Predicting Search Engine Ranking
In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.