Virtualized dynamic URL assignment web crawling model

Wani Rohit Bhaginath, S. Shingade, M. Shirole
{"title":"Virtualized dynamic URL assignment web crawling model","authors":"Wani Rohit Bhaginath, S. Shingade, M. Shirole","doi":"10.1109/ICAETR.2014.7012963","DOIUrl":null,"url":null,"abstract":"Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.","PeriodicalId":196504,"journal":{"name":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Advances in Engineering & Technology Research (ICAETR - 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAETR.2014.7012963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.
虚拟动态URL分配网络爬虫模型
网络搜索引擎是一种软件系统,它通过接受以查询形式输入的信息,并以文件、页面、图像或信息的形式提供结果,从而帮助从网上检索信息。这些搜索引擎严重依赖于网络爬虫,这些爬虫与数百万个给定种子URL或种子URL列表的网页进行交互。然而,这些爬虫需要大量的计算资源。网络搜索引擎的效率取决于爬行过程的性能。尽管爬行过程不断改进,但仍然需要向更高效、更低成本的爬行器改进。现有的大多数爬虫都有一个集中的协调器,这带来了单点故障的缺点。针对现有网络爬虫的不足,提出了一种分布式网络爬虫的体系结构。该体系结构解决了现有网络爬虫的两个问题:第一个是使用云计算虚拟化的概念创建低成本的网络爬虫。第二个问题是基于url动态分配的均衡负载分布。第一个问题是使用多核机器解决的,其中每个多核处理器被划分为多个虚拟机(VM),这些虚拟机可以并行执行不同的爬行任务。第二个问题是使用集群算法解决的,该算法根据集群的可用性将请求分配给机器,从而根据组件的实时状况实现组件之间的平衡。本文讨论了分布式架构和算法的实现细节。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信