Freshness of Web search engines: Improving performance of Web search engines using data mining techniques

S. Kharazmi, Ali Farahmand Nejad, H. Abolhassani
{"title":"Freshness of Web search engines: Improving performance of Web search engines using data mining techniques","authors":"S. Kharazmi, Ali Farahmand Nejad, H. Abolhassani","doi":"10.1109/ICITST.2009.5402607","DOIUrl":null,"url":null,"abstract":"Progressive use of Web based information retrieval systems such as general purpose search engines and dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. Freshness (recency) is one of the important maintaining factors of Web search engine crawlers that takes weeks to months. Many large Web crawlers start from seed pages, fetch every links from them, and continually repeat this process without any policies that help them to better crawling and improving performance of those. We believe that data mining techniques can help us to improve the freshness parameter by extracting knowledge from crawling data. In this paper we propose a Web crawler that uses extracted knowledge by data mining techniques as policies for crawling. For this purpose we include a component to collect additional crawling information. This crawler starts by non-preferential crawling. After a few crawling, it trained by using mining techniques on crawling data and then uses policies for preferential crawling to improve freshness time. Our research represented that crawling with determined polices has better freshness than generic general purpose Web crawlers.","PeriodicalId":251169,"journal":{"name":"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITST.2009.5402607","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Progressive use of Web based information retrieval systems such as general purpose search engines and dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. Freshness (recency) is one of the important maintaining factors of Web search engine crawlers that takes weeks to months. Many large Web crawlers start from seed pages, fetch every links from them, and continually repeat this process without any policies that help them to better crawling and improving performance of those. We believe that data mining techniques can help us to improve the freshness parameter by extracting knowledge from crawling data. In this paper we propose a Web crawler that uses extracted knowledge by data mining techniques as policies for crawling. For this purpose we include a component to collect additional crawling information. This crawler starts by non-preferential crawling. After a few crawling, it trained by using mining techniques on crawling data and then uses policies for preferential crawling to improve freshness time. Our research represented that crawling with determined polices has better freshness than generic general purpose Web crawlers.
Web搜索引擎的新鲜度:使用数据挖掘技术改进Web搜索引擎的性能
基于Web的信息检索系统(如通用搜索引擎)的逐步使用和Web的动态特性使得持续维护基于Web的信息检索系统成为必要。爬虫通过跟踪Web页面中的超链接来自动下载新的和更新的Web页面,从而简化了此过程。新鲜度(近时性)是Web搜索引擎爬虫需要花费数周到数月的重要维护因素之一。许多大型Web爬虫从种子页面开始,从中获取每个链接,并不断重复这个过程,没有任何策略来帮助它们更好地爬行和提高这些页面的性能。我们认为数据挖掘技术可以通过从爬行数据中提取知识来帮助我们提高新鲜度参数。本文提出了一种利用数据挖掘技术提取的知识作为抓取策略的网络爬虫。为此,我们包含了一个组件来收集额外的爬行信息。这个爬虫从非优先爬行开始。经过几次爬行后,利用挖掘技术对爬行数据进行训练,然后使用优先爬行策略来提高新鲜度。我们的研究表明,具有确定策略的爬行比一般的通用Web爬虫具有更好的新鲜度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信