An efficient method in pre-processing phase of mining suspicious web crawlers

2017 21st International Conference on System Theory, Control and Computing (ICSTCC) Pub Date : 2017-10-01 DOI:10.1109/ICSTCC.2017.8107046

Miron Catalin, Aflori Cristian

{"title":"An efficient method in pre-processing phase of mining suspicious web crawlers","authors":"Miron Catalin, Aflori Cristian","doi":"10.1109/ICSTCC.2017.8107046","DOIUrl":null,"url":null,"abstract":"The reports from last years outline the fact that the web crawlers (robots, bots) activities generate more than a half of web traffic on Internet. Web robots can be good (used for example by search engines) or bad (for bypassing security solutions, scraping, spamming or hacking), but usually all take up the internet bandwidth and can cause damage to businesses that rely on web traffic or content. Sorting human online traffic from bot activity isn't an easy task. The constantly evolving range of attacks, and the continuous optimization of bots, pose a new set of challenges. Our proposal is the first step for a larger automated solution that implies using the various intrusion detection system (IDS) methods and tools combined with mining algorithms. The final objective of the solution is the automated and effective detection of real bad bot threats for taking the appropriate security measures. The method proposes an automated flow from capturing the network traffic to the extraction of the input data to the mining algorithms (as the pre-processing step) and also an initial pattern detection and visualization with the scope of identifying potential threats generated by suspicious web crawlers. The first results are encouraging and represent the initial phase of identifying the potential threats from bad web robots.","PeriodicalId":374572,"journal":{"name":"2017 21st International Conference on System Theory, Control and Computing (ICSTCC)","volume":"256 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 21st International Conference on System Theory, Control and Computing (ICSTCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSTCC.2017.8107046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The reports from last years outline the fact that the web crawlers (robots, bots) activities generate more than a half of web traffic on Internet. Web robots can be good (used for example by search engines) or bad (for bypassing security solutions, scraping, spamming or hacking), but usually all take up the internet bandwidth and can cause damage to businesses that rely on web traffic or content. Sorting human online traffic from bot activity isn't an easy task. The constantly evolving range of attacks, and the continuous optimization of bots, pose a new set of challenges. Our proposal is the first step for a larger automated solution that implies using the various intrusion detection system (IDS) methods and tools combined with mining algorithms. The final objective of the solution is the automated and effective detection of real bad bot threats for taking the appropriate security measures. The method proposes an automated flow from capturing the network traffic to the extraction of the input data to the mining algorithms (as the pre-processing step) and also an initial pattern detection and visualization with the scope of identifying potential threats generated by suspicious web crawlers. The first results are encouraging and represent the initial phase of identifying the potential threats from bad web robots.

查看原文本刊更多论文

一种有效的预处理阶段可疑网络爬虫挖掘方法

去年的报告概述了一个事实，即网络爬虫(机器人，机器人)活动产生了互联网上超过一半的网络流量。网络机器人可以是好的(例如用于搜索引擎)也可以是坏的(用于绕过安全解决方案，抓取，发送垃圾邮件或黑客攻击)，但通常都占用互联网带宽，并可能对依赖网络流量或内容的业务造成损害。从机器人活动中分类人类在线流量并不是一件容易的事。不断演变的攻击范围，以及机器人的不断优化，带来了一系列新的挑战。我们的建议是一个更大的自动化解决方案的第一步，这意味着使用各种入侵检测系统(IDS)方法和工具结合挖掘算法。该解决方案的最终目标是自动有效地检测真正的恶意机器人威胁，以便采取适当的安全措施。该方法提出了一个从捕获网络流量到提取输入数据到挖掘算法(作为预处理步骤)的自动化流程，以及一个初始模式检测和可视化，其范围是识别可疑网络爬虫产生的潜在威胁。第一个结果是令人鼓舞的，并且代表了识别来自不良网络机器人的潜在威胁的初始阶段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 21st International Conference on System Theory, Control and Computing (ICSTCC)

自引率

0.00%

发文量