阿拉伯网站的存档情况如何?

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries Pub Date : 2015-06-21 DOI:10.1145/2756406.2756912

Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle

{"title":"阿拉伯网站的存档情况如何?","authors":"Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle","doi":"10.1145/2756406.2756912","DOIUrl":null,"url":null,"abstract":"It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"How Well Are Arabic Websites Archived?\",\"authors\":\"Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle\",\"doi\":\"10.1145/2756406.2756912\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.\",\"PeriodicalId\":256118,\"journal\":{\"name\":\"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries\",\"volume\":\"2013 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2756406.2756912\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2756406.2756912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

人们早就知道，网络档案和搜索引擎偏爱西方和英语网站。在本文中，我们定量地探讨如何很好地索引和存档是阿拉伯语网站。我们首先从三个不同的网站目录中抽样15092个唯一的uri: DMOZ(多语言)、Raddadi和Star28(主要都是阿拉伯语)。使用语言识别工具，我们删除了非阿拉伯语的网页(例如，半岛电视台网站的英语版本)，并剔除了7976个绝对是阿拉伯语的网页。然后，我们使用这7,976个页面，并抓取实时网络和网络档案，生成一个包含300,646个阿拉伯语页面的集合。我们发现:1) 46%没有被归档和31%不是由谷歌索引(www.google.com), 2)只有14.84%的uri有阿拉伯国家代码顶级域名(如.sa),仅10.53%的机构有GeoIP在一个阿拉伯国家,3)有只有一个阿拉伯语GeoIP或只有一个阿拉伯语顶级域名似乎产生负面影响存档,4)附近的存档页面最顶级的站点和深层链接到站点不是well-archived,5)在目录中的存在对索引和DMOZ目录中的存在有积极的影响，特别是对归档有积极的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How Well Are Arabic Websites Archived?

It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

自引率

0.00%

发文量