Using Web Pages Dynamicity to Prioritise Web Crawling

Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence Pub Date : 2019-09-18 DOI:10.1145/3366750.3366757

Nisreen Alderratia, M. Elsheh

{"title":"Using Web Pages Dynamicity to Prioritise Web Crawling","authors":"Nisreen Alderratia, M. Elsheh","doi":"10.1145/3366750.3366757","DOIUrl":null,"url":null,"abstract":"Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.","PeriodicalId":145378,"journal":{"name":"Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366750.3366757","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.

查看原文本刊更多论文

使用网页动态来优先考虑网络爬虫

网络爬行是指从网络中收集网页，以便根据用户的要求建立索引并显示搜索结果的过程。此外，网络爬虫必须不断重新访问网页，以保持搜索引擎数据库的更新。此外，在抓取过程中确定首先要抓取的最重要的页面是至关重要的。这是为了避免网络爬行过程中所面临的时间限制和网络问题。因此，本研究试图引入一种用于指示爬虫的方法，具体来说，是为了确定它应该以什么顺序重新抓取之前已经抓取的网页，以便比其他人更早获得更重要和有价值的页面。此外，研究人员提出了一种基于主题相似度的网页抓取策略，该策略结合网页的动态性，通过下载相关网页并递归地重新抓取。此外，每当其中一页出现变化时，其计数器就会增加。因此，如果页面是相关的并且经常更改，则会将其视为重要页面，并在爬行过程中给予高优先级。得到的结果表明，利用网页的动态性是在爬行过程中对网页进行优先级排序的有效方法，因为网页内容变化的可能性很大，所以首先获得动态最大的网页，而不是最不动态的网页。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence

自引率

0.00%

发文量