用两步过滤器描述阿联酋国家网络

2011 IEEE GCC Conference and Exhibition (GCC) Pub Date : 2011-04-19 DOI:10.1109/IEEEGCC.2011.5752614

M. Sanver, Chiraz BenAbdelkader

{"title":"用两步过滤器描述阿联酋国家网络","authors":"M. Sanver, Chiraz BenAbdelkader","doi":"10.1109/IEEEGCC.2011.5752614","DOIUrl":null,"url":null,"abstract":"The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.","PeriodicalId":119104,"journal":{"name":"2011 IEEE GCC Conference and Exhibition (GCC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Characterizing the UAE national Web with a two-step filter\",\"authors\":\"M. Sanver, Chiraz BenAbdelkader\",\"doi\":\"10.1109/IEEEGCC.2011.5752614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.\",\"PeriodicalId\":119104,\"journal\":{\"name\":\"2011 IEEE GCC Conference and Exhibition (GCC)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE GCC Conference and Exhibition (GCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IEEEGCC.2011.5752614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE GCC Conference and Exhibition (GCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEEEGCC.2011.5752614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

Web是围绕分层域系统组织的大量页面集合。对于搜索、分析或其他目的，从中选择一个子集是一个具有挑战性的问题。在本文中，我们解决了将与国家相关的页面确定为子集的问题。如果一个网页带有或代表一个特定国家的身份，那么它就“属于”一个国家的网页。使用国家域(如.ae、.uk)作为主要标识符，使用IP地址、地理位置和语言作为增强/辅助标识符已经不够了。我们提出了一个两步网页分类器(1)预抓取过滤器和(2)后抓取过滤器。前一阶段是在获取/下载网页之前，清除不属于被调查国家的网页;后一阶段是在下载网页后，通过多步分析，过滤不相关的网页。我们使用阿拉伯联合酋长国(UAE)国家Web作为案例研究。我们分享我们在国家网络上爬行的经验，介绍为完成任务而设计的爬行器，并且呈现我们的一些结果和发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Characterizing the UAE national Web with a two-step filter

The Web as a large collection of pages is organized around a hierarchical domain system. For searching, analyzing or other purposes, selecting a subset from it is a challenging problem. In this paper, we address the issue of determining the pages related to a country as a subset. A Web page ‘belongs’ to a national Web if it bears or represents identities from a particular country. Using the national domain such as .ae, .uk, as primary identifier and IP address, geographic locations, and language as augmented/secondary identifiers is no longer adequate. We propose a two-step Web page classifier (1) pre-crawl filter and (2) post-crawl filter. The former stage prunes out Web pages not belonging to the nation under investigation before fetching/downloading a Web page while the later stage filters irrelevant ones through a multiple-step analysis after downloading a page. We used the United Arab Emirates (UAE) national Web as a case study. We share our experience crawling the national Web, introduce the crawler designed to accomplish the task, and present some of our results and findings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE GCC Conference and Exhibition (GCC)

自引率

0.00%

发文量