HiCrawl:一个用于医疗领域的隐藏网络爬虫

2013 International Symposium on Computational and Business Intelligence Pub Date : 2013-08-24 DOI:10.1109/ISCBI.2013.39

Sonali Gupta, K. Bhatia

{"title":"HiCrawl:一个用于医疗领域的隐藏网络爬虫","authors":"Sonali Gupta, K. Bhatia","doi":"10.1109/ISCBI.2013.39","DOIUrl":null,"url":null,"abstract":"The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.","PeriodicalId":311471,"journal":{"name":"2013 International Symposium on Computational and Business Intelligence","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"HiCrawl: A Hidden Web Crawler for Medical Domain\",\"authors\":\"Sonali Gupta, K. Bhatia\",\"doi\":\"10.1109/ISCBI.2013.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.\",\"PeriodicalId\":311471,\"journal\":{\"name\":\"2013 International Symposium on Computational and Business Intelligence\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Symposium on Computational and Business Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCBI.2013.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Symposium on Computational and Business Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCBI.2013.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

隐藏的Web指的是包含大量可自由访问的Web数据库的WWW的很大一部分，隐藏在搜索表单界面后面，只能通过响应用户在搜索表单界面上发出的查询而生成的动态网页来访问。因此，为隐藏Web实现任何爬虫的核心挑战是，通过自动生成和发出有助于发现这些动态Web页面的查询，常规地超越这些搜索表单接口。本文提供了一种新颖的方法来指导爬虫选择正确的查询词，将其提交到任何设计为接受关键字或术语作为输入的搜索表单界面。该系统基于分类层次结构的使用，这些层次结构可能是手动构建的，也可能是自动构建的。为了说明，我们考虑了“医疗”领域的搜索表单接口，这是研究人员使用的最流行的领域之一，并且在同一领域中使用手动生成的自顶向下分类层次结构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HiCrawl: A Hidden Web Crawler for Medical Domain

The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the 'Medical' domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 International Symposium on Computational and Business Intelligence

自引率

0.00%

发文量