基于抽象语法树简化和显式持续时间递归网络的 PHP 恶意 webhell 检测

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2024-08-10 DOI:10.1016/j.cose.2024.104049

{"title":"基于抽象语法树简化和显式持续时间递归网络的 PHP 恶意 webhell 检测","authors":"","doi":"10.1016/j.cose.2024.104049","DOIUrl":null,"url":null,"abstract":"<div><p>Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.</p></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PHP-based malicious webshell detection based on abstract syntax tree simplification and explicit duration recurrent networks\",\"authors\":\"\",\"doi\":\"10.1016/j.cose.2024.104049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.</p></div>\",\"PeriodicalId\":51004,\"journal\":{\"name\":\"Computers & Security\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167404824003547\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824003547","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

恶意 webshell 是攻击者在网络渗透中最常用的攻击脚本。攻击者通常会混淆基于 PHP 的恶意 webshell 字符串并加密通信流量，以绕过安全设备。在这种情况下，基于 PHP 的恶意 webshell 的操作码序列会变得过长，并包含许多不相关的特征，从而影响检测方法的效果。本研究提出了一种新的基于 PHP 的恶意 webhell 检测方法。该方法针对 PHP 脚本抽象语法树中的三种主要节点类型引入了三种简化策略，以减少基于 PHP 的恶意 webhell 的操作码序列长度和噪声。显式持续时间递归网络（EDRN）是一种基于扩展隐式半马尔可夫模型的递归神经网络，用于检测恶意 webhell。采用 Word2vec 将 PHP 脚本的操作码序列转换成向量，作为 EDRN 的输入。实验使用了从 GitHub 收集的公共数据集。实验结果表明，EDRN 的性能优于流行的递归神经网络。与几种最先进的方法和主流工具相比，所提出的方法表现出卓越的性能，准确率达到 0.993，F1 分数达到 0.990，召回率达到 0.991。当仅使用 20% 的数据集进行训练时，所提出的方法的准确率、召回率和 F1 分数分别达到了 0.985、0.983 和 0.980，明显优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PHP-based malicious webshell detection based on abstract syntax tree simplification and explicit duration recurrent networks

Malicious webshells are the most common attack scripts used by attackers in web penetration. Attackers typically obfuscate strings of PHP-based malicious webshells and encrypt communication traffic to bypass security devices. In this case, the opcode sequences of the PHP-based malicious webshells become excessively long and contain many irrelevant features, which affect the efficacy of the detection method. This study proposes a new PHP-based malicious webshell detection method. The proposed method introduces three simplification strategies for the three main types of nodes in the abstract syntax trees of PHP scripts to reduce the length and noise of opcode sequences of PHP-based malicious webshells. An explicit duration recurrent network (EDRN), a recurrent neural network based on an extended hidden semi-Markov model, is used to detect malicious webshells. Word2vec is adopted to convert the opcode sequences of the PHP scripts into vectors that serve as the input for the EDRN. Experiments were conducted using public datasets collected from GitHub. The experimental results indicated that EDRN outperformed popular recurrent neural networks. The proposed method demonstrated superior performance compared with several state-of-the-art approaches and mainstream tools, achieving an accuracy of 0.993, an F1 score of 0.990, and a recall rate of 0.991. When only 20% of the datasets were used for training, the proposed method achieved accuracy, recall, and F1 scores of 0.985, 0.983, and 0.980, respectively, significantly outperforming existing approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.