一种新的基于强化学习的智能爬行算法

Ali Mohammad Zareh Bidoki, N. Yazdani, Pedram Ghodsnia
{"title":"一种新的基于强化学习的智能爬行算法","authors":"Ali Mohammad Zareh Bidoki, N. Yazdani, Pedram Ghodsnia","doi":"10.3233/WIA-2009-0174","DOIUrl":null,"url":null,"abstract":"The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.","PeriodicalId":263450,"journal":{"name":"Web Intell. Agent Syst.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"FICA: A novel intelligent crawling algorithm based on reinforcement learning\",\"authors\":\"Ali Mohammad Zareh Bidoki, N. Yazdani, Pedram Ghodsnia\",\"doi\":\"10.3233/WIA-2009-0174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.\",\"PeriodicalId\":263450,\"journal\":{\"name\":\"Web Intell. Agent Syst.\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Web Intell. Agent Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/WIA-2009-0174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Web Intell. Agent Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/WIA-2009-0174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

网络是一个巨大的、高度动态的环境,内容呈指数级增长,结构也在快速发展。没有搜索引擎可以覆盖整个网络,因此它必须专注于最有价值的页面进行爬行。因此,检索最重要页面的有效爬行算法仍然是一个具有挑战性的问题。PageRank和OPIC等算法已经被提出。不幸的是,它们具有高时间复杂度和低吞吐量。本文提出了一种基于强化学习的智能爬行算法(FICA),对随机上网用户进行建模。抓取页面的优先级是基于我们称之为对数距离的概念。FICA易于实现,其时间复杂度为O(E*logV),其中V和E分别为网络图中的节点数和边数。FICA与其他算法的比较表明,FICA在发现高度重要页面方面优于其他算法。此外,FICA在抓取过程中计算每个页面的重要性(排名)。因此,我们也可以使用FICA作为计算页面重要性的排序方法。FICA的一个很好的特性是它对网络的适应性,它可以随着网络图的变化而动态调整。我们使用英国的网络图表来评估我们的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
FICA: A novel intelligent crawling algorithm based on reinforcement learning
The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信