{"title":"Virtualized dynamic URL assignment web crawling model","authors":"Wani Rohit Bhaginath, S. Shingade, M. Shirole","doi":"10.1109/ICAETR.2014.7012963","DOIUrl":null,"url":null,"abstract":"Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.","PeriodicalId":196504,"journal":{"name":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAETR.2014.7012963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.