An efficient method in pre-processing phase of mining suspicious web crawlers

Miron Catalin, Aflori Cristian
{"title":"An efficient method in pre-processing phase of mining suspicious web crawlers","authors":"Miron Catalin, Aflori Cristian","doi":"10.1109/ICSTCC.2017.8107046","DOIUrl":null,"url":null,"abstract":"The reports from last years outline the fact that the web crawlers (robots, bots) activities generate more than a half of web traffic on Internet. Web robots can be good (used for example by search engines) or bad (for bypassing security solutions, scraping, spamming or hacking), but usually all take up the internet bandwidth and can cause damage to businesses that rely on web traffic or content. Sorting human online traffic from bot activity isn't an easy task. The constantly evolving range of attacks, and the continuous optimization of bots, pose a new set of challenges. Our proposal is the first step for a larger automated solution that implies using the various intrusion detection system (IDS) methods and tools combined with mining algorithms. The final objective of the solution is the automated and effective detection of real bad bot threats for taking the appropriate security measures. The method proposes an automated flow from capturing the network traffic to the extraction of the input data to the mining algorithms (as the pre-processing step) and also an initial pattern detection and visualization with the scope of identifying potential threats generated by suspicious web crawlers. The first results are encouraging and represent the initial phase of identifying the potential threats from bad web robots.","PeriodicalId":374572,"journal":{"name":"2017 21st International Conference on System Theory, Control and Computing (ICSTCC)","volume":"256 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 21st International Conference on System Theory, Control and Computing (ICSTCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSTCC.2017.8107046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

The reports from last years outline the fact that the web crawlers (robots, bots) activities generate more than a half of web traffic on Internet. Web robots can be good (used for example by search engines) or bad (for bypassing security solutions, scraping, spamming or hacking), but usually all take up the internet bandwidth and can cause damage to businesses that rely on web traffic or content. Sorting human online traffic from bot activity isn't an easy task. The constantly evolving range of attacks, and the continuous optimization of bots, pose a new set of challenges. Our proposal is the first step for a larger automated solution that implies using the various intrusion detection system (IDS) methods and tools combined with mining algorithms. The final objective of the solution is the automated and effective detection of real bad bot threats for taking the appropriate security measures. The method proposes an automated flow from capturing the network traffic to the extraction of the input data to the mining algorithms (as the pre-processing step) and also an initial pattern detection and visualization with the scope of identifying potential threats generated by suspicious web crawlers. The first results are encouraging and represent the initial phase of identifying the potential threats from bad web robots.
一种有效的预处理阶段可疑网络爬虫挖掘方法
去年的报告概述了一个事实,即网络爬虫(机器人,机器人)活动产生了互联网上超过一半的网络流量。网络机器人可以是好的(例如用于搜索引擎)也可以是坏的(用于绕过安全解决方案,抓取,发送垃圾邮件或黑客攻击),但通常都占用互联网带宽,并可能对依赖网络流量或内容的业务造成损害。从机器人活动中分类人类在线流量并不是一件容易的事。不断演变的攻击范围,以及机器人的不断优化,带来了一系列新的挑战。我们的建议是一个更大的自动化解决方案的第一步,这意味着使用各种入侵检测系统(IDS)方法和工具结合挖掘算法。该解决方案的最终目标是自动有效地检测真正的恶意机器人威胁,以便采取适当的安全措施。该方法提出了一个从捕获网络流量到提取输入数据到挖掘算法(作为预处理步骤)的自动化流程,以及一个初始模式检测和可视化,其范围是识别可疑网络爬虫产生的潜在威胁。第一个结果是令人鼓舞的,并且代表了识别来自不良网络机器人的潜在威胁的初始阶段。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信