改进了基于块分区检索相关页面的集中爬行方法

Debashis Hati, Amritesh Kumar
{"title":"改进了基于块分区检索相关页面的集中爬行方法","authors":"Debashis Hati, Amritesh Kumar","doi":"10.1109/ICETC.2010.5529547","DOIUrl":null,"url":null,"abstract":"Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large number of websites, traditional web crawlers cannot function well to get the relevant pages effectively. To solve these problems, focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler is a special-purpose search engine which aims to selectively seek out pages that are relevant to a predefined set of topics, rather than to exploit all regions of the web. The main characteristic of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves only the relevant pages. So the major problem is how to retrieve the maximal set of relevant and quality pages. To address this problem, we have designed a focused crawler which calculates the relevancy of block in web page. The Block is partitioned by VIPS algorithm. Page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score for identifying whether a URL is relevant or not for a specific topic.","PeriodicalId":299461,"journal":{"name":"2010 2nd International Conference on Education Technology and Computer","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Improved focused crawling approach for retrieving relevant pages based on block partitioning\",\"authors\":\"Debashis Hati, Amritesh Kumar\",\"doi\":\"10.1109/ICETC.2010.5529547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large number of websites, traditional web crawlers cannot function well to get the relevant pages effectively. To solve these problems, focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler is a special-purpose search engine which aims to selectively seek out pages that are relevant to a predefined set of topics, rather than to exploit all regions of the web. The main characteristic of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves only the relevant pages. So the major problem is how to retrieve the maximal set of relevant and quality pages. To address this problem, we have designed a focused crawler which calculates the relevancy of block in web page. The Block is partitioned by VIPS algorithm. Page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score for identifying whether a URL is relevant or not for a specific topic.\",\"PeriodicalId\":299461,\"journal\":{\"name\":\"2010 2nd International Conference on Education Technology and Computer\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 2nd International Conference on Education Technology and Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETC.2010.5529547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 2nd International Conference on Education Technology and Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETC.2010.5529547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

爬虫是一种可以遍历互联网并通过超链接检索网页的软件。面对大量的网站,传统的网络爬虫无法很好地运行,无法有效地获取相关页面。为了解决这些问题,重点爬虫利用语义web技术来分析超链接和web文档的语义。聚焦爬虫是一种特殊用途的搜索引擎,旨在选择性地寻找与预定义主题集相关的页面,而不是利用网络的所有区域。集中爬行的主要特点是,爬行器不需要收集所有的网页,而只选择和检索相关的网页。因此,主要的问题是如何检索最大的相关和高质量的页面集。为了解决这一问题,我们设计了一个集中爬虫来计算网页中块的相关性。区块采用VIPS算法进行分区。页面相关性通过一个页面中所有块相关性分数的总和来计算。它还计算URL分数,以确定URL是否与特定主题相关。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improved focused crawling approach for retrieving relevant pages based on block partitioning
Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large number of websites, traditional web crawlers cannot function well to get the relevant pages effectively. To solve these problems, focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler is a special-purpose search engine which aims to selectively seek out pages that are relevant to a predefined set of topics, rather than to exploit all regions of the web. The main characteristic of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves only the relevant pages. So the major problem is how to retrieve the maximal set of relevant and quality pages. To address this problem, we have designed a focused crawler which calculates the relevancy of block in web page. The Block is partitioned by VIPS algorithm. Page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score for identifying whether a URL is relevant or not for a specific topic.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信