Mitigating Web Scrapers using Markup Randomization

2021 Palestinian International Conference on Information and Communication Technology (PICICT) Pub Date : 2021-09-01 DOI:10.1109/PICICT53635.2021.00038

Noor Bolbol, T. Barhoom

{"title":"Mitigating Web Scrapers using Markup Randomization","authors":"Noor Bolbol, T. Barhoom","doi":"10.1109/PICICT53635.2021.00038","DOIUrl":null,"url":null,"abstract":"Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.","PeriodicalId":308869,"journal":{"name":"2021 Palestinian International Conference on Information and Communication Technology (PICICT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Palestinian International Conference on Information and Communication Technology (PICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICICT53635.2021.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.

查看原文本刊更多论文

使用标记随机化减少Web抓取

网络抓取是一种通过扫描网站的内部链接和内容自动提取所需数据的技术，这种活动通常由系统编程的机器人执行。本文解释了我们提出的解决方案，通过减轻抓取机器人来保护博客内容免受盗窃和被复制到其他目的地。为了实现我们的目的，我们在两个层面上应用了两个步骤，第一个是在主博客页面层面，通过在真实文章中添加额外的空文章锚点来减轻爬虫机器人的工作，下一步是在文章页面层面，我们在文章主体中添加随机生成文本的随机数量的空和隐藏跨度。为了评估这个解决方案，我们将其应用到一个在Laravel框架下使用PHP语言开发的本地项目中，并提出了四个衡量有效性的标准。结果表明，应用程序前后文件大小的变化对其没有影响，并且处理时间增加了几毫秒，仍然在可接受的范围内。通过使用html相似性工具，我们得到了非常好的结果，显示了对称的样式，在结构上有一些变化。最后，为了评估对bot的影响，对scraper bot进行重用，并从编程的中间件中获得预期的结果。结果表明，该解决方案在博客内容保护中是可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Palestinian International Conference on Information and Communication Technology (PICICT)

自引率

0.00%

发文量