{"title":"一个两阶段的关键字为基础的爬虫收集深层网站","authors":"Ewit","doi":"10.30534/IJCCN/2018/29722018","DOIUrl":null,"url":null,"abstract":"Deep web is termed as sites present on web but not accessible by any search engine. Due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. Keyword based crawler for hidden web interfaces consist of mainly two stages, first is site locating another is in-site exploring. Site locating starts from seed sites and obtains relevant websites through reverse searching and obtains relevant sites through feature space of URL, anchor and text around URL. Second stage receives input from site locating and starts to find relevant link from those sites. The adaptive link learner is used to find out relevant links with help of link priority and link rank. To eliminate inclination on visiting some more closely related links in inaccessible web directories, we design a data structure called link tree to achieve broader coverage for a website.","PeriodicalId":313852,"journal":{"name":"International Journal of Computing, Communications and Networking","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A TWO-STAGE KEYWORD BASED CRAWLER FOR GATHERING DEEP-WEB SITES\",\"authors\":\"Ewit\",\"doi\":\"10.30534/IJCCN/2018/29722018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep web is termed as sites present on web but not accessible by any search engine. Due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. Keyword based crawler for hidden web interfaces consist of mainly two stages, first is site locating another is in-site exploring. Site locating starts from seed sites and obtains relevant websites through reverse searching and obtains relevant sites through feature space of URL, anchor and text around URL. Second stage receives input from site locating and starts to find relevant link from those sites. The adaptive link learner is used to find out relevant links with help of link priority and link rank. To eliminate inclination on visiting some more closely related links in inaccessible web directories, we design a data structure called link tree to achieve broader coverage for a website.\",\"PeriodicalId\":313852,\"journal\":{\"name\":\"International Journal of Computing, Communications and Networking\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computing, Communications and Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30534/IJCCN/2018/29722018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing, Communications and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30534/IJCCN/2018/29722018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A TWO-STAGE KEYWORD BASED CRAWLER FOR GATHERING DEEP-WEB SITES
Deep web is termed as sites present on web but not accessible by any search engine. Due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. Keyword based crawler for hidden web interfaces consist of mainly two stages, first is site locating another is in-site exploring. Site locating starts from seed sites and obtains relevant websites through reverse searching and obtains relevant sites through feature space of URL, anchor and text around URL. Second stage receives input from site locating and starts to find relevant link from those sites. The adaptive link learner is used to find out relevant links with help of link priority and link rank. To eliminate inclination on visiting some more closely related links in inaccessible web directories, we design a data structure called link tree to achieve broader coverage for a website.