通过多种种子语料库生成增强协议模糊

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-08-04 DOI:10.1109/TSE.2025.3595396

Zhengxiong Luo;Qingpeng Du;Yujue Wang;Abhik Roychoudhury;Yu Jiang

{"title":"通过多种种子语料库生成增强协议模糊","authors":"Zhengxiong Luo;Qingpeng Du;Yujue Wang;Abhik Roychoudhury;Yu Jiang","doi":"10.1109/TSE.2025.3595396","DOIUrl":null,"url":null,"abstract":"Protocol fuzzing is an effective technique for discovering vulnerabilities in protocol implementations. Although much progress has been made in optimizing input mutation, the initial seed inputs, which serve as the starting point for fuzzing, are still a critical factor in determining the effectiveness of subsequent fuzzing. Existing methods for seed corpus preparation mainly rely on captured network traffic, which suffers from limited diversity due to the biased message distributions present in real-world traffic. Protocol specifications encompass detailed information on diverse messages and thus provide a more comprehensive way for seed corpus preparation. However, these specifications are voluminous and not directly machine-readable. To address this challenge, we introduce PSG, which enhances protocol fuzzing by leveraging large language models (LLMs) to analyze protocol specifications for generating a high-quality seed corpus. First, PSG systematically reorganizes the protocol specification metadata into a structured knowledge base for effective LLM augmentation. Then, PSG employs a grammar-free method to generate target protocol messages and incorporates an iterative refinement process for better accuracy and efficiency. Our evaluation on 7 widely-used protocols and 13 implementations demonstrates that PSG can effectively generate diverse, protocol-compliant message inputs. Moreover, the generated seed corpus significantly improves the performance of state-of-the-art black-box and grey-box protocol fuzzers, achieving higher branch coverage and discovering more zero-day bugs.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2693-2709"},"PeriodicalIF":5.6000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Protocol Fuzzing via Diverse Seed Corpus Generation\",\"authors\":\"Zhengxiong Luo;Qingpeng Du;Yujue Wang;Abhik Roychoudhury;Yu Jiang\",\"doi\":\"10.1109/TSE.2025.3595396\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protocol fuzzing is an effective technique for discovering vulnerabilities in protocol implementations. Although much progress has been made in optimizing input mutation, the initial seed inputs, which serve as the starting point for fuzzing, are still a critical factor in determining the effectiveness of subsequent fuzzing. Existing methods for seed corpus preparation mainly rely on captured network traffic, which suffers from limited diversity due to the biased message distributions present in real-world traffic. Protocol specifications encompass detailed information on diverse messages and thus provide a more comprehensive way for seed corpus preparation. However, these specifications are voluminous and not directly machine-readable. To address this challenge, we introduce PSG, which enhances protocol fuzzing by leveraging large language models (LLMs) to analyze protocol specifications for generating a high-quality seed corpus. First, PSG systematically reorganizes the protocol specification metadata into a structured knowledge base for effective LLM augmentation. Then, PSG employs a grammar-free method to generate target protocol messages and incorporates an iterative refinement process for better accuracy and efficiency. Our evaluation on 7 widely-used protocols and 13 implementations demonstrates that PSG can effectively generate diverse, protocol-compliant message inputs. Moreover, the generated seed corpus significantly improves the performance of state-of-the-art black-box and grey-box protocol fuzzers, achieving higher branch coverage and discovering more zero-day bugs.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"51 9\",\"pages\":\"2693-2709\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2025-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11108709/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11108709/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

协议模糊测试是发现协议实现漏洞的一种有效技术。尽管在优化输入突变方面已经取得了很大进展，但作为模糊化起点的初始种子输入仍然是决定后续模糊化有效性的关键因素。现有的种子语料库制备方法主要依赖于捕获的网络流量，由于真实流量中的消息分布存在偏差，导致网络流量的多样性有限。协议规范包含了各种消息的详细信息，因此为种子语料库的准备提供了更全面的方法。然而，这些规范非常庞大，不能直接由机器读取。为了应对这一挑战，我们引入了PSG，它通过利用大型语言模型（llm）来分析协议规范以生成高质量的种子语料库，从而增强了协议模糊。首先，PSG系统地将协议规范元数据重组为结构化知识库，以实现有效的LLM扩展。然后，PSG采用无语法方法生成目标协议消息，并结合迭代改进过程以提高准确性和效率。我们对7种广泛使用的协议和13种实现的评估表明，PSG可以有效地生成多种协议兼容的消息输入。此外，生成的种子语料库显著提高了最先进的黑盒和灰盒协议模糊器的性能，实现了更高的分支覆盖率并发现了更多的零日漏洞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enhancing Protocol Fuzzing via Diverse Seed Corpus Generation

Protocol fuzzing is an effective technique for discovering vulnerabilities in protocol implementations. Although much progress has been made in optimizing input mutation, the initial seed inputs, which serve as the starting point for fuzzing, are still a critical factor in determining the effectiveness of subsequent fuzzing. Existing methods for seed corpus preparation mainly rely on captured network traffic, which suffers from limited diversity due to the biased message distributions present in real-world traffic. Protocol specifications encompass detailed information on diverse messages and thus provide a more comprehensive way for seed corpus preparation. However, these specifications are voluminous and not directly machine-readable. To address this challenge, we introduce PSG, which enhances protocol fuzzing by leveraging large language models (LLMs) to analyze protocol specifications for generating a high-quality seed corpus. First, PSG systematically reorganizes the protocol specification metadata into a structured knowledge base for effective LLM augmentation. Then, PSG employs a grammar-free method to generate target protocol messages and incorporates an iterative refinement process for better accuracy and efficiency. Our evaluation on 7 widely-used protocols and 13 implementations demonstrates that PSG can effectively generate diverse, protocol-compliant message inputs. Moreover, the generated seed corpus significantly improves the performance of state-of-the-art black-box and grey-box protocol fuzzers, achieving higher branch coverage and discovering more zero-day bugs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.