An Ontology-Based Topical Crawling Algorithm for Accessing Deep Web Content

K. Arya, B. R. Vadlamudi
{"title":"An Ontology-Based Topical Crawling Algorithm for Accessing Deep Web Content","authors":"K. Arya, B. R. Vadlamudi","doi":"10.1109/ICCCT.2012.10","DOIUrl":null,"url":null,"abstract":"Due to the large volume of the Web information and relatively high speed of information update, the coverage and quality of the retrieved pages by modern search engines is comparatively small. Given the volume of the Web and its frequency of content change, the coverage and quality of pages retrieved by modern search engines is relatively small since they crawl only hypertext links ignoring the search forms which are the entry points for accessing deep web content where two-thirds of information is resides. In this paper an algorithm has been designed to enable topical crawlers to access hidden web content by using domain based ontology to determine the forms' relevance to the domain. In this work scientific research publications domain has been considered. Experimental results show that proposed approach is better as compared to keyword based crawlers in terms of both relevancy and completeness.","PeriodicalId":235770,"journal":{"name":"2012 Third International Conference on Computer and Communication Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Third International Conference on Computer and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCT.2012.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Due to the large volume of the Web information and relatively high speed of information update, the coverage and quality of the retrieved pages by modern search engines is comparatively small. Given the volume of the Web and its frequency of content change, the coverage and quality of pages retrieved by modern search engines is relatively small since they crawl only hypertext links ignoring the search forms which are the entry points for accessing deep web content where two-thirds of information is resides. In this paper an algorithm has been designed to enable topical crawlers to access hidden web content by using domain based ontology to determine the forms' relevance to the domain. In this work scientific research publications domain has been considered. Experimental results show that proposed approach is better as compared to keyword based crawlers in terms of both relevancy and completeness.
一种基于本体的深度网络内容抓取算法
由于网络信息量大,信息更新速度较快,现代搜索引擎检索页面的覆盖面和质量相对较小。考虑到网络的容量和内容变化的频率,现代搜索引擎检索页面的覆盖范围和质量相对较小,因为它们只抓取超文本链接,而忽略了搜索表单,而搜索表单是访问深层网络内容的入口点,而深层网络内容包含了三分之二的信息。本文设计了一种算法,通过基于领域的本体来确定表单与领域的相关性,使主题爬虫能够访问隐藏的web内容。在这项工作中,科学研究出版物领域已被考虑。实验结果表明,该方法在相关性和完整性方面都优于基于关键词的爬虫。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信