Web中内容重用的精确检测

Comput. Commun. Rev. Pub Date : 2019-05-21 DOI:10.1145/3336937.3336940

Calvin Ardi, J. Heidemann

{"title":"Web中内容重用的精确检测","authors":"Calvin Ardi, J. Heidemann","doi":"10.1145/3336937.3336940","DOIUrl":null,"url":null,"abstract":"With vast amount of content online, it is not surprising that unscrupulous entities \"borrow\" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.","PeriodicalId":403234,"journal":{"name":"Comput. Commun. Rev.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Precise Detection of Content Reuse in the Web\",\"authors\":\"Calvin Ardi, J. Heidemann\",\"doi\":\"10.1145/3336937.3336940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With vast amount of content online, it is not surprising that unscrupulous entities \\\"borrow\\\" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.\",\"PeriodicalId\":403234,\"journal\":{\"name\":\"Comput. Commun. Rev.\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput. Commun. Rev.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3336937.3336940\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Commun. Rev.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3336937.3336940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

由于网络上有大量的内容，不道德的实体“借用”网络来为广告、链接农场和垃圾邮件提供内容也就不足为奇了。我们的见解是，加密哈希和指纹可以有效地识别网络大小语料库的内容重用。我们开发了两种相关的算法，一种是自动*发现*网络中以前未知的重复内容，第二种是*精确检测*发现或手动识别内容的副本。我们展示了“坏邻居”，即经常被复制内容的页面集群，有助于识别网络中的复制。我们通过三个网络数据集的控制实验验证了我们的算法及其选择:Common Crawl (2009/10)， GeoCities(1990 - 2000)和一个网络钓鱼语料库(2014)。我们展示了加密散列的使用比位置敏感散列等替代方法要精确得多，避免了可能发生的数千个误报。我们在三个系统中应用我们的方法:发现和检测网络中的重复内容，在网络中明确搜索维基百科的副本，以及在网络浏览器中检测钓鱼网站。我们发现，网络上的一般复制通常是良性的(例如模板)，但6-11%是商业性的或可能商业性的。维基百科的大部分副本(86%)都是商业化的(链接农场或广告)。对于网络钓鱼，我们专注于PayPal，即使没有故意伪装，也能检测到59%的PayPal网络钓鱼。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Precise Detection of Content Reuse in the Web

With vast amount of content online, it is not surprising that unscrupulous entities "borrow" from the web to provide content for advertisements, link farms, and spam. Our insight is that cryptographic hashing and fingerprinting can efficiently identify content reuse for web-size corpora. We develop two related algorithms, one to automatically *discover* previously unknown duplicate content in the web, and the second to *precisely detect* copies of discovered or manually identified content. We show that *bad neighborhoods*, clusters of pages where copied content is frequent, help identify copying in the web. We verify our algorithm and its choices with controlled experiments over three web datasets: Common Crawl (2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014). We show that our use of cryptographic hashing is much more precise than alternatives such as locality-sensitive hashing, avoiding the thousands of false-positives that would otherwise occur. We apply our approach in three systems: discovering and detecting duplicated content in the web, searching explicitly for copies of Wikipedia in the web, and detecting phishing sites in a web browser. We show that general copying in the web is often benign (for example, templates), but 6–11% are commercial or possibly commercial. Most copies of Wikipedia (86%) are commercialized (link farming or advertisements). For phishing, we focus on PayPal, detecting 59% of PayPal-phish even without taking on intentional cloaking.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Comput. Commun. Rev.

自引率

0.00%

发文量