预测搜索引擎排名的特征选择和分类方法

Willy K. Portier, Yujian Li, B. A. Kouassi
{"title":"预测搜索引擎排名的特征选择和分类方法","authors":"Willy K. Portier, Yujian Li, B. A. Kouassi","doi":"10.1145/3432291.3432309","DOIUrl":null,"url":null,"abstract":"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.","PeriodicalId":126684,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Feature Selection and Classification Methods for Predicting Search Engine Ranking\",\"authors\":\"Willy K. Portier, Yujian Li, B. A. Kouassi\",\"doi\":\"10.1145/3432291.3432309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.\",\"PeriodicalId\":126684,\"journal\":{\"name\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3432291.3432309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3432291.3432309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在过去的二十年中,通过使用机器学习的方法,成功地提高了执行计算机辅助任务的准确性。搜索引擎(谷歌、百度、必应等)使用分类方法对万维网上可用的10亿个页面进行排名。排名是根据具有各种特征的算法进行的,这些算法对搜索引擎请求的每个页面进行分类。本文的目的是分析各种机器学习模型应用于通过不同技术选择的特征的性能。一个由31个特征和28000个观测值组成的数据集,只考虑相关度最高的特征。为了实现这一目标,使用了三种过滤方法(卡方、基尼指数和费舍尔)和三种包装方法(向前选择、向后消除和双向消除)。为了继续研究,测试了各种分类算法,以创建与先前过滤和包装方法的组合模型。然后,进行比较以确定最佳特征组合,以提高对Google Top10 SERP URL的正确预测。从研究中可以得出结论,对于该数据集,随机森林模型结合Fisher滤波方法或向后消除包装方法可以产生最好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Feature Selection and Classification Methods for Predicting Search Engine Ranking
In the two-past decade, by using the methods of machine learning, the accuracy of performing computer-aided tasks successfully improved. Search engines (Google, Baidu, Bing...) use classification methods to rank the billion pages available on the world wide web. Rankings are made according to the algorithms with various features, which classify each page for a search engine request. The purpose of this paper is to analyze the performance of various machine learning models applied on features selected through different techniques. A dataset, composed of 31 features with 28,000 observations, has been evaluated considering only the characteristics with the highest correlation. To achieve that goal three filter methods were used (Chi-square, Gini index and Fisher) and three wrapper methods (Forward Selection, Backward Elimination and Bidirectional Elimination). To continue the research various classification algorithms were tested to create combination models with previous filtered and wrapper methods. Then, a comparison was done to determine the optimal features' combinations, to improve the correct prediction for an URL to be on Google Top10 SERP. From the research, it can be concluded that for this dataset, the Random Forest model combined with the Fisher filter method or Backward Elimination wrapper method could produce the best results among others.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信