An Ontology-Based Topical Crawling Algorithm for Accessing Deep Web Content

2012 Third International Conference on Computer and Communication Technology Pub Date : 2012-11-23 DOI:10.1109/ICCCT.2012.10

K. Arya, B. R. Vadlamudi

引用次数: 3

Abstract

Due to the large volume of the Web information and relatively high speed of information update, the coverage and quality of the retrieved pages by modern search engines is comparatively small. Given the volume of the Web and its frequency of content change, the coverage and quality of pages retrieved by modern search engines is relatively small since they crawl only hypertext links ignoring the search forms which are the entry points for accessing deep web content where two-thirds of information is resides. In this paper an algorithm has been designed to enable topical crawlers to access hidden web content by using domain based ontology to determine the forms' relevance to the domain. In this work scientific research publications domain has been considered. Experimental results show that proposed approach is better as compared to keyword based crawlers in terms of both relevancy and completeness.

查看原文本刊更多论文

一种基于本体的深度网络内容抓取算法

由于网络信息量大，信息更新速度较快，现代搜索引擎检索页面的覆盖面和质量相对较小。考虑到网络的容量和内容变化的频率，现代搜索引擎检索页面的覆盖范围和质量相对较小，因为它们只抓取超文本链接，而忽略了搜索表单，而搜索表单是访问深层网络内容的入口点，而深层网络内容包含了三分之二的信息。本文设计了一种算法，通过基于领域的本体来确定表单与领域的相关性，使主题爬虫能够访问隐藏的web内容。在这项工作中，科学研究出版物领域已被考虑。实验结果表明，该方法在相关性和完整性方面都优于基于关键词的爬虫。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Third International Conference on Computer and Communication Technology

自引率

0.00%

发文量