{"title":"网站普查","authors":"A. Qadeer, Waqar Mahmood, A. Waheed","doi":"10.1109/ICITST.2009.5402623","DOIUrl":null,"url":null,"abstract":"The website census is an effort to enumerate all the websites on the World Wide Web (WWW) without using crawling. Crawling is a traditional way of website discovery. It is conceptually simple but the very size of the WWW makes the implementation complex and resource demanding. The enormous amount of bandwidth, a huge persistent storage pool, a sufficiently large cluster of machines for data processing and a complex set of software systems are just a few examples of the needed resources. In this work, we use exhaustive IP range probing to detect the presence of a web server on TCP port 80. Although this probing is exhaustive in nature, it is lightweight in terms of resource demands. This enumeration of websites has many applications. The most obvious is to use it as a seed to the conventional crawling. It can be refined to be used as a top level domain (TLD) specific seed for targeted crawling.","PeriodicalId":251169,"journal":{"name":"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"The website census\",\"authors\":\"A. Qadeer, Waqar Mahmood, A. Waheed\",\"doi\":\"10.1109/ICITST.2009.5402623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The website census is an effort to enumerate all the websites on the World Wide Web (WWW) without using crawling. Crawling is a traditional way of website discovery. It is conceptually simple but the very size of the WWW makes the implementation complex and resource demanding. The enormous amount of bandwidth, a huge persistent storage pool, a sufficiently large cluster of machines for data processing and a complex set of software systems are just a few examples of the needed resources. In this work, we use exhaustive IP range probing to detect the presence of a web server on TCP port 80. Although this probing is exhaustive in nature, it is lightweight in terms of resource demands. This enumeration of websites has many applications. The most obvious is to use it as a seed to the conventional crawling. It can be refined to be used as a top level domain (TLD) specific seed for targeted crawling.\",\"PeriodicalId\":251169,\"journal\":{\"name\":\"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICITST.2009.5402623\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference for Internet Technology and Secured Transactions, (ICITST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITST.2009.5402623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The website census is an effort to enumerate all the websites on the World Wide Web (WWW) without using crawling. Crawling is a traditional way of website discovery. It is conceptually simple but the very size of the WWW makes the implementation complex and resource demanding. The enormous amount of bandwidth, a huge persistent storage pool, a sufficiently large cluster of machines for data processing and a complex set of software systems are just a few examples of the needed resources. In this work, we use exhaustive IP range probing to detect the presence of a web server on TCP port 80. Although this probing is exhaustive in nature, it is lightweight in terms of resource demands. This enumeration of websites has many applications. The most obvious is to use it as a seed to the conventional crawling. It can be refined to be used as a top level domain (TLD) specific seed for targeted crawling.