Don’t Stop Believin’: A Unified Evaluation Approach for LLM Honeypots

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2024-10-02 DOI:10.1109/ACCESS.2024.3472460

Simon B. Weber;Marc Feger;Michael Pilgermann

{"title":"Don’t Stop Believin’: A Unified Evaluation Approach for LLM Honeypots","authors":"Simon B. Weber;Marc Feger;Michael Pilgermann","doi":"10.1109/ACCESS.2024.3472460","DOIUrl":null,"url":null,"abstract":"The research area of honeypots is gaining new momentum, driven by advancements in large language models (LLMs). The chat-based applications of generative pretrained transformer (GPT) models seem ideal for the use as honeypot backends, especially in request-response protocols like Secure Shell (SSH). By leveraging LLMs, many challenges associated with traditional honeypots – such as high development costs, ease of exposure, and breakout risks – appear to be solved. While early studies have primarily focused on the potential of these models, our research investigates the current limitations of GPT-3.5 by analyzing three datasets of varying complexity. We conducted an expert annotation of over 1,400 request-response pairs, encompassing 230 different base commands. Our findings reveal that while GPT-3.5 struggles to maintain context, incorporating session context into response generation improves the quality of SSH responses. Additionally, we explored whether distinguishing between convincing and non-convincing responses is a metrics issue. We propose a paraphrase-mining approach to address this challenge, which achieved a macro F1 score of 77.85% using cosine distance in our evaluation. This method has the potential to reduce annotation efforts, converge LLM-based honeypot performance evaluation, and facilitate comparisons between new and previous approaches in future research.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"12 ","pages":"144579-144587"},"PeriodicalIF":3.4000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10703029","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10703029/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The research area of honeypots is gaining new momentum, driven by advancements in large language models (LLMs). The chat-based applications of generative pretrained transformer (GPT) models seem ideal for the use as honeypot backends, especially in request-response protocols like Secure Shell (SSH). By leveraging LLMs, many challenges associated with traditional honeypots – such as high development costs, ease of exposure, and breakout risks – appear to be solved. While early studies have primarily focused on the potential of these models, our research investigates the current limitations of GPT-3.5 by analyzing three datasets of varying complexity. We conducted an expert annotation of over 1,400 request-response pairs, encompassing 230 different base commands. Our findings reveal that while GPT-3.5 struggles to maintain context, incorporating session context into response generation improves the quality of SSH responses. Additionally, we explored whether distinguishing between convincing and non-convincing responses is a metrics issue. We propose a paraphrase-mining approach to address this challenge, which achieved a macro F1 score of 77.85% using cosine distance in our evaluation. This method has the potential to reduce annotation efforts, converge LLM-based honeypot performance evaluation, and facilitate comparisons between new and previous approaches in future research.

查看原文本刊更多论文

不要停止相信：LLM 蜜罐的统一评估方法

在大型语言模型（LLM）的推动下，"蜜罐 "研究领域正获得新的发展动力。基于聊天的生成预训练变换器（GPT）模型似乎非常适合用作蜜罐后端，尤其是在安全外壳（SSH）等请求-响应协议中。通过利用 LLM，与传统 "巢穴 "相关的许多难题--如高昂的开发成本、易暴露性和突破风险--似乎都迎刃而解了。早期的研究主要关注这些模型的潜力，而我们的研究则通过分析三个不同复杂度的数据集来调查 GPT-3.5 目前的局限性。我们对 1,400 多个请求-响应对进行了专家注释，其中包括 230 个不同的基本命令。我们的研究结果表明，虽然 GPT-3.5 在维护上下文方面存在困难，但将会话上下文纳入响应生成过程可以提高 SSH 响应的质量。此外，我们还探讨了区分有说服力和无说服力的响应是否是一个指标问题。我们提出了一种转述挖掘方法来应对这一挑战，该方法在我们的评估中使用余弦距离获得了 77.85% 的宏观 F1 分数。这种方法有可能减少注释工作，使基于 LLM 的蜜罐性能评估趋于一致，并有助于在未来研究中对新方法和以前的方法进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.