FP-Crawlers: Studying the Resilience of Browser Fingerprinting to Block Crawlers

Antoine Vastel, Walter Rudametkin, Romain Rouvoy, Xavier Blanc
{"title":"FP-Crawlers: Studying the Resilience of Browser Fingerprinting to Block Crawlers","authors":"Antoine Vastel, Walter Rudametkin, Romain Rouvoy, Xavier Blanc","doi":"10.14722/madweb.2020.23010","DOIUrl":null,"url":null,"abstract":"Data available on the Web, such as financial data or public reviews, provides a competitive advantage to companies able to exploit them. Web crawlers, a category of bot, aim at automating the collection of publicly available Web data. While some crawlers collect data with the agreement of the websites being crawled, most crawlers do not respect the terms of service. CAPTCHAs and approaches based on analyzing series of HTTP requests classify users as humans or bots. However, these approaches require either user interaction or a significant volume of data before they can classify the traffic. \n \nIn this paper, we study browser fingerprinting as a crawler detection mechanism. We crawled the Alexa top 10K and identified 291 websites that block crawlers. We show that fingerprinting is used by 93 (31.96%) of them and we report on the crawler detection techniques implemented by the major fingerprinters. Finally, we evaluate the resilience of fingerprinting against crawlers trying to conceal themselves. We show that although fingerprinting is good at detecting crawlers, it can be bypassed with little effort by an adversary with knowledge on the fingerprints collected.","PeriodicalId":408238,"journal":{"name":"Proceedings 2020 Workshop on Measurements, Attacks, and Defenses for the Web","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2020 Workshop on Measurements, Attacks, and Defenses for the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14722/madweb.2020.23010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

Abstract

Data available on the Web, such as financial data or public reviews, provides a competitive advantage to companies able to exploit them. Web crawlers, a category of bot, aim at automating the collection of publicly available Web data. While some crawlers collect data with the agreement of the websites being crawled, most crawlers do not respect the terms of service. CAPTCHAs and approaches based on analyzing series of HTTP requests classify users as humans or bots. However, these approaches require either user interaction or a significant volume of data before they can classify the traffic. In this paper, we study browser fingerprinting as a crawler detection mechanism. We crawled the Alexa top 10K and identified 291 websites that block crawlers. We show that fingerprinting is used by 93 (31.96%) of them and we report on the crawler detection techniques implemented by the major fingerprinters. Finally, we evaluate the resilience of fingerprinting against crawlers trying to conceal themselves. We show that although fingerprinting is good at detecting crawlers, it can be bypassed with little effort by an adversary with knowledge on the fingerprints collected.
fp -爬虫:研究浏览器指纹阻止爬虫的弹性
Web上可用的数据,例如财务数据或公共评论,为能够利用这些数据的公司提供了竞争优势。网络爬虫是机器人的一个类别,旨在自动收集公开可用的Web数据。虽然一些爬虫收集数据与被抓取的网站的协议,大多数爬虫不尊重服务条款。验证码和基于分析一系列HTTP请求的方法将用户分类为人类或机器人。然而,在对流量进行分类之前,这些方法要么需要用户交互,要么需要大量数据。本文将浏览器指纹作为一种爬虫检测机制进行研究。我们抓取了Alexa排名前10K的网站,并确定了291个阻止爬虫的网站。我们发现其中93个(31.96%)使用了指纹识别,我们报告了主要指纹打印机实现的爬虫检测技术。最后,我们评估了指纹对试图隐藏自己的爬虫的弹性。我们表明,尽管指纹识别在检测爬虫方面很好,但它可以被了解所收集指纹的对手毫不费力地绕过。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信