{"title":"用两步过滤器描述阿联酋国家网络","authors":"M. Sanver, Chiraz BenAbdelkader","doi":"10.1109/IEEEGCC.2011.5752614","DOIUrl":null,"url":null,"abstract":"The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.","PeriodicalId":119104,"journal":{"name":"2011 IEEE GCC Conference and Exhibition (GCC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Characterizing the UAE national Web with a two-step filter\",\"authors\":\"M. Sanver, Chiraz BenAbdelkader\",\"doi\":\"10.1109/IEEEGCC.2011.5752614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.\",\"PeriodicalId\":119104,\"journal\":{\"name\":\"2011 IEEE GCC Conference and Exhibition (GCC)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE GCC Conference and Exhibition (GCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IEEEGCC.2011.5752614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE GCC Conference and Exhibition (GCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEEEGCC.2011.5752614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Characterizing the UAE national Web with a two-step filter
The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.