FilteredWeb: A framework for the automated search-based discovery of blocked URLs

Alexander Darer, Oliver Farnan, Joss Wright
{"title":"FilteredWeb: A framework for the automated search-based discovery of blocked URLs","authors":"Alexander Darer, Oliver Farnan, Joss Wright","doi":"10.23919/TMA.2017.8002914","DOIUrl":null,"url":null,"abstract":"Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a novel framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are used as the basis for web search queries. The resulting URLs of these searches are checked for evidence of filtering, and newly discovered blocked resources will be fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows the approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. When deployed, this system was used to discover 1355 poisoned domains within China as of Feb 2017 — 30 times more than in the most widely-used published filter list of the time. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.","PeriodicalId":118082,"journal":{"name":"2017 Network Traffic Measurement and Analysis Conference (TMA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Network Traffic Measurement and Analysis Conference (TMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/TMA.2017.8002914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a novel framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are used as the basis for web search queries. The resulting URLs of these searches are checked for evidence of filtering, and newly discovered blocked resources will be fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows the approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. When deployed, this system was used to discover 1355 poisoned domains within China as of Feb 2017 — 30 times more than in the most widely-used published filter list of the time. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.
FilteredWeb:一个基于自动搜索发现被阻止url的框架
人们提出了各种方法来创建和维护可能被过滤的url列表,以便对世界各地正在进行的互联网审查进行衡量。虽然测试已知资源的过滤证据相对简单,但给予适当的有利条件,发现以前未知的过滤web资源仍然是一个开放的挑战。我们提出了一个新的框架,通过使用对知名搜索引擎的自适应查询来自动化发现过滤资源的过程。我们的系统应用信息检索算法从已知的过滤网页中分离出特征语言模式;这些被用作网络搜索查询的基础。检查这些搜索的结果url是否有过滤的证据,并将新发现的被阻止的资源反馈到系统中,以检测进一步过滤的内容。我们对该框架的实施,应用于中国作为案例研究,表明该方法在检测大量以前未知的过滤网页方面明显有效,为不断发展的互联网过滤检测做出了重大贡献。部署后,截至2017年2月,该系统在中国境内发现了1355个有毒域名,比当时最广泛使用的发布过滤列表多30倍。其中,759是Alexa前1000域名列表之外,展示了这个框架的能力,以找到更多模糊的过滤内容。此外,我们对过滤网址的初步分析,以及用来发现它们的搜索词,让我们进一步了解目前在中国被屏蔽的内容的性质。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信