HiCrawl:一个用于医疗领域的隐藏网络爬虫

Sonali Gupta, K. Bhatia
{"title":"HiCrawl:一个用于医疗领域的隐藏网络爬虫","authors":"Sonali Gupta, K. Bhatia","doi":"10.1109/ISCBI.2013.39","DOIUrl":null,"url":null,"abstract":"The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.","PeriodicalId":311471,"journal":{"name":"2013 International Symposium on Computational and Business Intelligence","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"HiCrawl: A Hidden Web Crawler for Medical Domain\",\"authors\":\"Sonali Gupta, K. Bhatia\",\"doi\":\"10.1109/ISCBI.2013.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.\",\"PeriodicalId\":311471,\"journal\":{\"name\":\"2013 International Symposium on Computational and Business Intelligence\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Symposium on Computational and Business Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCBI.2013.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Symposium on Computational and Business Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCBI.2013.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

隐藏的Web指的是包含大量可自由访问的Web数据库的WWW的很大一部分,隐藏在搜索表单界面后面,只能通过响应用户在搜索表单界面上发出的查询而生成的动态网页来访问。因此,为隐藏Web实现任何爬虫的核心挑战是,通过自动生成和发出有助于发现这些动态Web页面的查询,常规地超越这些搜索表单接口。本文提供了一种新颖的方法来指导爬虫选择正确的查询词,将其提交到任何设计为接受关键字或术语作为输入的搜索表单界面。该系统基于分类层次结构的使用,这些层次结构可能是手动构建的,也可能是自动构建的。为了说明,我们考虑了“医疗”领域的搜索表单接口,这是研究人员使用的最流行的领域之一,并且在同一领域中使用手动生成的自顶向下分类层次结构。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
HiCrawl: A Hidden Web Crawler for Medical Domain
The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信