Performance Evaluation of Density-Based Clustering Methods for Categorizing Web Robot Sessions

Dilip Singh Sisodia, Namrata Verma
{"title":"Performance Evaluation of Density-Based Clustering Methods for Categorizing Web Robot Sessions","authors":"Dilip Singh Sisodia, Namrata Verma","doi":"10.1109/ICACAT.2018.8933782","DOIUrl":null,"url":null,"abstract":"Web servers are flooded with programmed web scripts (termed as web robots or web crawlers) generated HTTP requests. The detection of web traffic generated by automated web scripts at server end is essential for blocking or at least minimizing the impact on server resources and services. Web robot sessions are characterized by their own navigational behavior and extracted features. Session labeling may be used for identification of robots by supervised learning. However, due to high frequency and dynamically changing behavior of robots session labeling is not feasible all the time. Therefore, in the absence of any label with generated session’s unsupervised learning prove very useful for segregating human and robot sessions. In this paper, the effectiveness of different density-based algorithms is evaluated on user session data. The user sessions are clustered in four possible categories such as human sessions, tentative human and robot sessions and tentative robots. The experiments are performed on five labeled datasets of varying session length. Three most popular density-based algorithms such as density-based spatial clustering with noise (DBSCAN), ordering points to identify the clustering structure (OPTICS), and density-based clustering (DENCLUE) are used for session clustering. The comparative performances of used clustering algorithms are evaluated using supervised and unsupervised validation indexes including Rand, Jaccard, Silhouette, and Davis-Bouldin index.","PeriodicalId":6575,"journal":{"name":"2018 International Conference on Advanced Computation and Telecommunication (ICACAT)","volume":"24 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Advanced Computation and Telecommunication (ICACAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACAT.2018.8933782","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Web servers are flooded with programmed web scripts (termed as web robots or web crawlers) generated HTTP requests. The detection of web traffic generated by automated web scripts at server end is essential for blocking or at least minimizing the impact on server resources and services. Web robot sessions are characterized by their own navigational behavior and extracted features. Session labeling may be used for identification of robots by supervised learning. However, due to high frequency and dynamically changing behavior of robots session labeling is not feasible all the time. Therefore, in the absence of any label with generated session’s unsupervised learning prove very useful for segregating human and robot sessions. In this paper, the effectiveness of different density-based algorithms is evaluated on user session data. The user sessions are clustered in four possible categories such as human sessions, tentative human and robot sessions and tentative robots. The experiments are performed on five labeled datasets of varying session length. Three most popular density-based algorithms such as density-based spatial clustering with noise (DBSCAN), ordering points to identify the clustering structure (OPTICS), and density-based clustering (DENCLUE) are used for session clustering. The comparative performances of used clustering algorithms are evaluated using supervised and unsupervised validation indexes including Rand, Jaccard, Silhouette, and Davis-Bouldin index.
基于密度的网络机器人会话分类聚类方法的性能评价
Web服务器充斥着生成HTTP请求的编程Web脚本(称为Web机器人或Web爬虫)。检测由服务器端自动web脚本生成的web流量对于阻止或至少最小化对服务器资源和服务的影响至关重要。Web机器人会话由其自身的导航行为和提取的特征来表征。会话标记可用于通过监督学习来识别机器人。然而,由于机器人的高频率和动态变化的行为,会话标记并不总是可行的。因此,在没有任何标签的情况下,生成会话的无监督学习被证明对隔离人类和机器人会话非常有用。本文对不同的基于密度的算法在用户会话数据上的有效性进行了评估。用户会话分为四类,如人类会话、暂定的人类和机器人会话以及暂定的机器人会话。实验在五个不同会话长度的标记数据集上进行。会话聚类使用了三种最流行的基于密度的算法,即基于密度的空间噪声聚类(DBSCAN)、排序点识别聚类结构(OPTICS)和基于密度的聚类(DENCLUE)。使用有监督和无监督验证指标(包括Rand、Jaccard、Silhouette和Davis-Bouldin指数)来评估所使用的聚类算法的比较性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信