Virtualized dynamic URL assignment web crawling model

2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014) Pub Date : 2014-08-01 DOI:10.1109/ICAETR.2014.7012963

Wani Rohit Bhaginath, S. Shingade, M. Shirole

{"title":"Virtualized dynamic URL assignment web crawling model","authors":"Wani Rohit Bhaginath, S. Shingade, M. Shirole","doi":"10.1109/ICAETR.2014.7012963","DOIUrl":null,"url":null,"abstract":"Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.","PeriodicalId":196504,"journal":{"name":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAETR.2014.7012963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.

查看原文本刊更多论文

虚拟动态URL分配网络爬虫模型

网络搜索引擎是一种软件系统，它通过接受以查询形式输入的信息，并以文件、页面、图像或信息的形式提供结果，从而帮助从网上检索信息。这些搜索引擎严重依赖于网络爬虫，这些爬虫与数百万个给定种子URL或种子URL列表的网页进行交互。然而，这些爬虫需要大量的计算资源。网络搜索引擎的效率取决于爬行过程的性能。尽管爬行过程不断改进，但仍然需要向更高效、更低成本的爬行器改进。现有的大多数爬虫都有一个集中的协调器，这带来了单点故障的缺点。针对现有网络爬虫的不足，提出了一种分布式网络爬虫的体系结构。该体系结构解决了现有网络爬虫的两个问题:第一个是使用云计算虚拟化的概念创建低成本的网络爬虫。第二个问题是基于url动态分配的均衡负载分布。第一个问题是使用多核机器解决的，其中每个多核处理器被划分为多个虚拟机(VM)，这些虚拟机可以并行执行不同的爬行任务。第二个问题是使用集群算法解决的，该算法根据集群的可用性将请求分配给机器，从而根据组件的实时状况实现组件之间的平衡。本文讨论了分布式架构和算法的实现细节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)

自引率

0.00%

发文量