基于粗糙集的特定主题网络抓取集成预测

2009 Seventh International Conference on Advances in Pattern Recognition Pub Date : 2009-02-04 DOI:10.1109/ICAPR.2009.17

S. Saha, C. A. Murthy, S. Pal

{"title":"基于粗糙集的特定主题网络抓取集成预测","authors":"S. Saha, C. A. Murthy, S. Pal","doi":"10.1109/ICAPR.2009.17","DOIUrl":null,"url":null,"abstract":"The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.","PeriodicalId":443926,"journal":{"name":"2009 Seventh International Conference on Advances in Pattern Recognition","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Rough Set Based Ensemble Prediction for Topic Specific Web Crawling\",\"authors\":\"S. Saha, C. A. Murthy, S. Pal\",\"doi\":\"10.1109/ICAPR.2009.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.\",\"PeriodicalId\":443926,\"journal\":{\"name\":\"2009 Seventh International Conference on Advances in Pattern Recognition\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Seventh International Conference on Advances in Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAPR.2009.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Seventh International Conference on Advances in Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAPR.2009.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

近年来，万维网的迅速发展使有用资源的发现问题成为一个重要的问题。针对特定主题的资源发现，最近提出了一些技术，如聚焦爬行和智能爬行。所有这些爬虫都使用超文本特性行为来执行特定于主题的资源发现。集中的爬虫使用抓取页面的相关性评分来对从中提取的未访问的url进行评分。然后将得分的url添加到边界。然后它会选择最好的URL进行下一步抓取。聚焦爬行器依靠被爬行页面的不同类型的特征来将爬行范围保持在期望的域内，它们从父页面和祖先页面的URL、锚文本、链接结构和文本内容中获得。不同的抓取算法使用这些不同的特性集来预测未访问的Web页面的相关性和质量。本文提出了一种基于粗糙集理论的组合方法。它使用决策规则组合了可用的预测，并且可以构建更大的特定于领域的集合，噪声更小。我们在这方面的实验为集中爬行提供了更好的收获率和更好的目标召回。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Rough Set Based Ensemble Prediction for Topic Specific Web Crawling

The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 Seventh International Conference on Advances in Pattern Recognition

自引率

0.00%

发文量