Identifying Sensitive URLs at Web-Scale

Proceedings of the ACM Internet Measurement Conference Pub Date : 2020-10-27 DOI:10.1145/3419394.3423653

S. Matic, Costas Iordanou, Georgios Smaragdakis, Nikolaos Laoutaris

{"title":"Identifying Sensitive URLs at Web-Scale","authors":"S. Matic, Costas Iordanou, Georgios Smaragdakis, Nikolaos Laoutaris","doi":"10.1145/3419394.3423653","DOIUrl":null,"url":null,"abstract":"Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.","PeriodicalId":255324,"journal":{"name":"Proceedings of the ACM Internet Measurement Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Internet Measurement Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419394.3423653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.

查看原文本刊更多论文

在web级识别敏感url

一些数据保护法包括保护与宗教、健康、性取向和其他敏感类别有关的个人数据的特别条款。有一个明确定义的敏感类别清单就足以手动提交投诉、进行调查和在法庭上起诉案件。然而，数据保护法并没有明确定义每个敏感类别下的内容类型。因此，目前尚不清楚如何实施主动措施，如通知用户，阻止跟踪器，并在用户访问敏感域名时自动提交投诉。为了支持这样的用例，我们转向Curlie.org众包分类法项目，用于绘制训练数据，为敏感url构建文本分类器。我们证明了我们的分类器能够以88%以上的准确率识别敏感的url，甚至能够以90%以上的准确率识别特定的敏感类别。然后，我们使用我们的分类器在Common Crawl项目收集的10亿个url语料库中搜索敏感url。我们在超过400万个域名中识别超过1.55亿个敏感url。尽管它们具有敏感性，但超过30%的这些url属于未使用HTTPS的域名。此外，在使用第三方cookie的敏感网页中，87%的第三方设置了至少一个持久cookie。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Internet Measurement Conference

自引率

0.00%

发文量