{"title":"Characterizing the UAE national Web with a two-step filter","authors":"M. Sanver, Chiraz BenAbdelkader","doi":"10.1109/IEEEGCC.2011.5752614","DOIUrl":null,"url":null,"abstract":"The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.","PeriodicalId":119104,"journal":{"name":"2011 IEEE GCC Conference and Exhibition (GCC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE GCC Conference and Exhibition (GCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEEEGCC.2011.5752614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.