预测搜索引擎排名的特征选择和分类方法

Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning Pub Date : 2020-10-22 DOI:10.1145/3432291.3432309

Willy K. Portier, Yujian Li, B. A. Kouassi

{"title":"预测搜索引擎排名的特征选择和分类方法","authors":"Willy K. Portier, Yujian Li, B. A. Kouassi","doi":"10.1145/3432291.3432309","DOIUrl":null,"url":null,"abstract":"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.","PeriodicalId":126684,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Feature Selection and Classification Methods for Predicting Search Engine Ranking\",\"authors\":\"Willy K. Portier, Yujian Li, B. A. Kouassi\",\"doi\":\"10.1145/3432291.3432309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.\",\"PeriodicalId\":126684,\"journal\":{\"name\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3432291.3432309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3432291.3432309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在过去的二十年中，通过使用机器学习的方法，成功地提高了执行计算机辅助任务的准确性。搜索引擎(谷歌、百度、必应等)使用分类方法对万维网上可用的10亿个页面进行排名。排名是根据具有各种特征的算法进行的，这些算法对搜索引擎请求的每个页面进行分类。本文的目的是分析各种机器学习模型应用于通过不同技术选择的特征的性能。一个由31个特征和28000个观测值组成的数据集，只考虑相关度最高的特征。为了实现这一目标，使用了三种过滤方法(卡方、基尼指数和费舍尔)和三种包装方法(向前选择、向后消除和双向消除)。为了继续研究，测试了各种分类算法，以创建与先前过滤和包装方法的组合模型。然后，进行比较以确定最佳特征组合，以提高对Google Top10 SERP URL的正确预测。从研究中可以得出结论，对于该数据集，随机森林模型结合Fisher滤波方法或向后消除包装方法可以产生最好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Feature Selection and Classification Methods for Predicting Search Engine Ranking

In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning

自引率

0.00%

发文量