Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-06-26 DOI:10.1109/TIFS.2025.3583552

Haolang Lu;Hongrui Peng;Guoshun Nan;Jiaoyang Cui;Cheng Wang;Weifei Jin;Songtao Wang;Shengli Pan;Xiaofeng Tao

{"title":"Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization","authors":"Haolang Lu;Hongrui Peng;Guoshun Nan;Jiaoyang Cui;Cheng Wang;Weifei Jin;Songtao Wang;Shengli Pan;Xiaofeng Tao","doi":"10.1109/TIFS.2025.3583552","DOIUrl":null,"url":null,"abstract":"Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose <sc>Malsight</small>, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries’ usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed <sc>Malsight</small>. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"6733-6747"},"PeriodicalIF":8.0000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11052730/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose Malsight, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries’ usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed Malsight. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.

查看原文本刊更多论文

MALSIGHT：探索迭代二进制恶意软件总结的恶意源代码和良性伪代码

二进制恶意软件摘要旨在从可执行文件自动生成人类可读的恶意软件行为描述，促进恶意软件破解和检测等任务。以前基于大型语言模型（llm）的方法已经显示出很大的前景。然而，它们仍然面临着重要的问题，包括较差的可用性、不准确的解释和不完整的摘要，主要是由于模糊的伪代码结构和缺乏恶意软件训练摘要。此外，函数之间的调用关系（涉及二进制恶意软件中的丰富交互）在很大程度上仍未得到充分研究。为此，我们提出了一种新的代码摘要框架Malsight，它可以通过探索恶意源代码和良性伪代码来迭代生成二进制恶意软件的描述。具体而言，我们使用LLM构建了第一个恶意软件摘要数据集，MalS和MalP，并通过人工努力手动优化该数据集。在训练阶段，我们在MalS和良性伪代码数据集上调整了我们提出的MalT5，一个新的基于llm的代码模型。然后，在测试阶段，我们迭代地将伪代码函数提供给MalT5以获得摘要。这样的过程有助于理解伪代码结构，并捕获函数之间复杂的交互，从而有利于摘要的可用性、准确性和完整性。此外，我们提出了一个新的评价基准，BLEURT-sum，以衡量摘要的质量。在三个数据集上的实验表明了该算法的有效性。值得注意的是，我们建议的MalT5，只有0.77B参数，提供了与更大的Code-Llama相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features