Automatic Generation of Web Censorship Probe Lists

Proceedings on Privacy Enhancing Technologies Pub Date : 2024-07-11 DOI:10.56553/popets-2024-0106

Jenny Tang, Léo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas Christin

{"title":"Automatic Generation of Web Censorship Probe Lists","authors":"Jenny Tang, Léo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas Christin","doi":"10.56553/popets-2024-0106","DOIUrl":null,"url":null,"abstract":"Domain probe lists---used to determine which URLs to probe for Web censorship---play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains---not present in the original dataset---we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.","PeriodicalId":519525,"journal":{"name":"Proceedings on Privacy Enhancing Technologies","volume":"55 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56553/popets-2024-0106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Domain probe lists---used to determine which URLs to probe for Web censorship---play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains---not present in the original dataset---we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.

查看原文本刊更多论文

自动生成网络审查探测列表

域名探查列表--用于确定对哪些 URL 进行网络审查探查--在互联网审查测量研究中起着至关重要的作用。事实上，域探针列表的大小和准确性限制了可检测到的审查网页集；不准确的列表可能导致对审查情况的不完整了解或结果的偏差。以往生成域名探针列表的方法大多是手动或众包的。这种方法耗时长、易出错，而且不能很好地适应不断变化的审查环境。在本文中，我们探讨了自动生成探测列表的方法，这些列表既全面又及时，可用于网络审查测量。我们从现有的各种测试列表中的 139,957 个唯一 URL 开始，生成新的候选页面。通过分析这些 URL 中的内容（即进行主题和关键词提取）、扩展这些主题并将其作为搜索引擎的馈送，我们的方法在 35,147 个域中生成了 119,255 个新 URL。然后，我们对新的候选网页进行测试，尝试在四个月的时间内从全球 11 个不同地点的服务器访问每个 URL，以检查它们的连接性和潜在的审查迹象。我们的测量结果表明，我们的方法发现了超过 1400 个域名--这些域名在原始数据集中并不存在--我们怀疑这些域名被屏蔽了。简而言之，自动更新探测列表是可行的，而且有助于进一步实现大规模审查测量的自动化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings on Privacy Enhancing Technologies

自引率

0.00%

发文量