{"title":"通过关键字查询下载文本隐藏的web内容","authors":"A. Ntoulas, P. Zerfos, Junghoo Cho","doi":"10.1145/1065385.1065407","DOIUrl":null,"url":null,"abstract":"An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the hidden Web or the deep Web. Since there are no static links to the hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective hidden Web crawler that can autonomously discover and download pages from the hidden Web. Since the only \"entry point\" to a hidden Web site is a query interface, the main challenge that a hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. We provide a theoretical framework to investigate the query generation problem for the hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries","PeriodicalId":248721,"journal":{"name":"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"236","resultStr":"{\"title\":\"Downloading textual hidden web content through keyword queries\",\"authors\":\"A. Ntoulas, P. Zerfos, Junghoo Cho\",\"doi\":\"10.1145/1065385.1065407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the hidden Web or the deep Web. Since there are no static links to the hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective hidden Web crawler that can autonomously discover and download pages from the hidden Web. Since the only \\\"entry point\\\" to a hidden Web site is a query interface, the main challenge that a hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. We provide a theoretical framework to investigate the query generation problem for the hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries\",\"PeriodicalId\":248721,\"journal\":{\"name\":\"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"236\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1065385.1065407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1065385.1065407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Downloading textual hidden web content through keyword queries
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the hidden Web or the deep Web. Since there are no static links to the hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective hidden Web crawler that can autonomously discover and download pages from the hidden Web. Since the only "entry point" to a hidden Web site is a query interface, the main challenge that a hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. We provide a theoretical framework to investigate the query generation problem for the hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries