Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization

IF 8 1区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Haolang Lu;Hongrui Peng;Guoshun Nan;Jiaoyang Cui;Cheng Wang;Weifei Jin;Songtao Wang;Shengli Pan;Xiaofeng Tao
{"title":"Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization","authors":"Haolang Lu;Hongrui Peng;Guoshun Nan;Jiaoyang Cui;Cheng Wang;Weifei Jin;Songtao Wang;Shengli Pan;Xiaofeng Tao","doi":"10.1109/TIFS.2025.3583552","DOIUrl":null,"url":null,"abstract":"Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose <sc>Malsight</small>, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries’ usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed <sc>Malsight</small>. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"6733-6747"},"PeriodicalIF":8.0000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11052730/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose Malsight, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries’ usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed Malsight. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.
MALSIGHT:探索迭代二进制恶意软件总结的恶意源代码和良性伪代码
二进制恶意软件摘要旨在从可执行文件自动生成人类可读的恶意软件行为描述,促进恶意软件破解和检测等任务。以前基于大型语言模型(llm)的方法已经显示出很大的前景。然而,它们仍然面临着重要的问题,包括较差的可用性、不准确的解释和不完整的摘要,主要是由于模糊的伪代码结构和缺乏恶意软件训练摘要。此外,函数之间的调用关系(涉及二进制恶意软件中的丰富交互)在很大程度上仍未得到充分研究。为此,我们提出了一种新的代码摘要框架Malsight,它可以通过探索恶意源代码和良性伪代码来迭代生成二进制恶意软件的描述。具体而言,我们使用LLM构建了第一个恶意软件摘要数据集,MalS和MalP,并通过人工努力手动优化该数据集。在训练阶段,我们在MalS和良性伪代码数据集上调整了我们提出的MalT5,一个新的基于llm的代码模型。然后,在测试阶段,我们迭代地将伪代码函数提供给MalT5以获得摘要。这样的过程有助于理解伪代码结构,并捕获函数之间复杂的交互,从而有利于摘要的可用性、准确性和完整性。此外,我们提出了一个新的评价基准,BLEURT-sum,以衡量摘要的质量。在三个数据集上的实验表明了该算法的有效性。值得注意的是,我们建议的MalT5,只有0.77B参数,提供了与更大的Code-Llama相当的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Information Forensics and Security
IEEE Transactions on Information Forensics and Security 工程技术-工程:电子与电气
CiteScore
14.40
自引率
7.40%
发文量
234
审稿时长
6.5 months
期刊介绍: The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信