基于粗糙集的特定主题网络抓取集成预测

S. Saha, C. A. Murthy, S. Pal
{"title":"基于粗糙集的特定主题网络抓取集成预测","authors":"S. Saha, C. A. Murthy, S. Pal","doi":"10.1109/ICAPR.2009.17","DOIUrl":null,"url":null,"abstract":"The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.","PeriodicalId":443926,"journal":{"name":"2009 Seventh International Conference on Advances in Pattern Recognition","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Rough Set Based Ensemble Prediction for Topic Specific Web Crawling\",\"authors\":\"S. Saha, C. A. Murthy, S. Pal\",\"doi\":\"10.1109/ICAPR.2009.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.\",\"PeriodicalId\":443926,\"journal\":{\"name\":\"2009 Seventh International Conference on Advances in Pattern Recognition\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Seventh International Conference on Advances in Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAPR.2009.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Seventh International Conference on Advances in Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAPR.2009.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

近年来,万维网的迅速发展使有用资源的发现问题成为一个重要的问题。针对特定主题的资源发现,最近提出了一些技术,如聚焦爬行和智能爬行。所有这些爬虫都使用超文本特性行为来执行特定于主题的资源发现。集中的爬虫使用抓取页面的相关性评分来对从中提取的未访问的url进行评分。然后将得分的url添加到边界。然后它会选择最好的URL进行下一步抓取。聚焦爬行器依靠被爬行页面的不同类型的特征来将爬行范围保持在期望的域内,它们从父页面和祖先页面的URL、锚文本、链接结构和文本内容中获得。不同的抓取算法使用这些不同的特性集来预测未访问的Web页面的相关性和质量。本文提出了一种基于粗糙集理论的组合方法。它使用决策规则组合了可用的预测,并且可以构建更大的特定于领域的集合,噪声更小。我们在这方面的实验为集中爬行提供了更好的收获率和更好的目标召回。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Rough Set Based Ensemble Prediction for Topic Specific Web Crawling
The rapid growth of the world wide web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next.Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages.Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on Rough Set Theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better Target recall for focused crawling.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信