Performance Optimization of Focused Web Crawling Using Content Block Segmentation

2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies Pub Date : 2014-01-09 DOI:10.1109/ICESC.2014.69

Bireshwar Ganguly, Devashri Raich

{"title":"Performance Optimization of Focused Web Crawling Using Content Block Segmentation","authors":"Bireshwar Ganguly, Devashri Raich","doi":"10.1109/ICESC.2014.69","DOIUrl":null,"url":null,"abstract":"The World Wide Web (WWW) is a collection of billions of documents formatted using HTML. Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called \"Crawlers\" or \"Spiders\". A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.","PeriodicalId":335267,"journal":{"name":"2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESC.2014.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The World Wide Web (WWW) is a collection of billions of documents formatted using HTML. Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called "Crawlers" or "Spiders". A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.

查看原文本刊更多论文

基于内容块分割的集中网络抓取性能优化

万维网(WWW)是使用HTML格式化的数十亿个文档的集合。网络搜索引擎用于在万维网上查找所需的信息。只要输入用户查询，就会通过该数据库执行搜索。搜索引擎存储库的大小不足以容纳网络上可用的每个页面。因此，只需要将最相关的页面存储在数据库中。因此，要从万维网上存储那些最相关的页面，必须遵循一种更好的方法。这种通过网络搜索相关网页的软件被称为“爬虫”或“蜘蛛”。一种专门的爬虫叫做聚焦爬虫，它遍历web并选择与定义的主题相关的页面，而不是探索web页面的所有区域。爬虫不收集所有的网页，而只检索所有相关的网页。因此，主要的问题是如何检索相关的和高质量的网页。为了解决这一问题，本文设计并实现了一种算法，该算法将网页按照标题划分为内容块，然后计算网页中每个划分块的相关性。然后通过一个页面中所有块相关性分数的总和来计算页面相关性。在标题的基础上，有一个适当的页面划分为块，因为一个完整的块只包括特定块的标题、内容、图像、链接、表格和子表格。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies

自引率

0.00%

发文量