{"title":"FICA: A Fast Intelligent Crawling Algorithm","authors":"Shady Shehata, F. Karray, Mohamed S. Kamel","doi":"10.1109/WI.2007.132","DOIUrl":null,"url":null,"abstract":"Due to the proliferation and highly dynamic nature of the Web, an efficient crawling and ranking algorithm for retrieving the most important pages has remained as a challenging issue. Several algorithms like PageRank (Page et al., 1998) and OPIC (Abiteboul et al., 2003) have been proposed. Unfortunately, they have high time complexity. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a real surfing user. The priority for crawling pages is based on a concept which we name as logarithmic distance. FICA is easy to implement and its time complexity is 0(E*logV) where V and E are the number of nodes and edges in the Web graph respectively. Comparison of the FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. We have used UK's Web graph for our experiments.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2007.132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
Due to the proliferation and highly dynamic nature of the Web, an efficient crawling and ranking algorithm for retrieving the most important pages has remained as a challenging issue. Several algorithms like PageRank (Page et al., 1998) and OPIC (Abiteboul et al., 2003) have been proposed. Unfortunately, they have high time complexity. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a real surfing user. The priority for crawling pages is based on a concept which we name as logarithmic distance. FICA is easy to implement and its time complexity is 0(E*logV) where V and E are the number of nodes and edges in the Web graph respectively. Comparison of the FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. We have used UK's Web graph for our experiments.
由于Web的扩散和高度动态性,检索最重要页面的有效爬行和排序算法仍然是一个具有挑战性的问题。PageRank (Page et al., 1998)和OPIC (Abiteboul et al., 2003)等算法已经被提出。不幸的是,它们具有很高的时间复杂度。本文提出了一种基于强化学习的智能爬行算法(FICA),对真实的上网用户进行建模。抓取页面的优先级是基于我们称之为对数距离的概念。FICA易于实现,其时间复杂度为0(E*logV),其中V和E分别为Web图中的节点数和边数。FICA与其他算法的比较表明,FICA在发现高度重要页面方面优于其他算法。此外,FICA在抓取过程中计算每个页面的重要性(排名)。因此,我们也可以使用FICA作为计算页面重要性的排序方法。我们在实验中使用了英国的网络图表。