FilteredWeb: A framework for the automated search-based discovery of blocked URLs

2017 Network Traffic Measurement and Analysis Conference (TMA) Pub Date : 2017-04-24 DOI:10.23919/TMA.2017.8002914

Alexander Darer, Oliver Farnan, Joss Wright

{"title":"FilteredWeb: A framework for the automated search-based discovery of blocked URLs","authors":"Alexander Darer, Oliver Farnan, Joss Wright","doi":"10.23919/TMA.2017.8002914","DOIUrl":null,"url":null,"abstract":"Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a novel framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are used as the basis for web search queries. The resulting URLs of these searches are checked for evidence of filtering, and newly discovered blocked resources will be fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows the approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. When deployed, this system was used to discover 1355 poisoned domains within China as of Feb 2017 — 30 times more than in the most widely-used published filter list of the time. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.","PeriodicalId":118082,"journal":{"name":"2017 Network Traffic Measurement and Analysis Conference (TMA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Network Traffic Measurement and Analysis Conference (TMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/TMA.2017.8002914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a novel framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are used as the basis for web search queries. The resulting URLs of these searches are checked for evidence of filtering, and newly discovered blocked resources will be fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows the approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. When deployed, this system was used to discover 1355 poisoned domains within China as of Feb 2017 — 30 times more than in the most widely-used published filter list of the time. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.

查看原文本刊更多论文

FilteredWeb:一个基于自动搜索发现被阻止url的框架

人们提出了各种方法来创建和维护可能被过滤的url列表，以便对世界各地正在进行的互联网审查进行衡量。虽然测试已知资源的过滤证据相对简单，但给予适当的有利条件，发现以前未知的过滤web资源仍然是一个开放的挑战。我们提出了一个新的框架，通过使用对知名搜索引擎的自适应查询来自动化发现过滤资源的过程。我们的系统应用信息检索算法从已知的过滤网页中分离出特征语言模式;这些被用作网络搜索查询的基础。检查这些搜索的结果url是否有过滤的证据，并将新发现的被阻止的资源反馈到系统中，以检测进一步过滤的内容。我们对该框架的实施，应用于中国作为案例研究，表明该方法在检测大量以前未知的过滤网页方面明显有效，为不断发展的互联网过滤检测做出了重大贡献。部署后，截至2017年2月，该系统在中国境内发现了1355个有毒域名，比当时最广泛使用的发布过滤列表多30倍。其中，759是Alexa前1000域名列表之外，展示了这个框架的能力，以找到更多模糊的过滤内容。此外，我们对过滤网址的初步分析，以及用来发现它们的搜索词，让我们进一步了解目前在中国被屏蔽的内容的性质。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 Network Traffic Measurement and Analysis Conference (TMA)

自引率

0.00%

发文量