SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs?

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-03-05 DOI:10.1109/TSE.2025.3548168

Mohamed Amine Ferrag;Ammar Battah;Norbert Tihanyi;Ridhi Jain;Diana Maimuţ;Fatima Alwahedi;Thierry Lestable;Narinderjit Singh Thandi;Abdechakour Mechri;Merouane Debbah;Lucas C. Cordeiro

{"title":"SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs?","authors":"Mohamed Amine Ferrag;Ammar Battah;Norbert Tihanyi;Ridhi Jain;Diana Maimuţ;Fatima Alwahedi;Thierry Lestable;Narinderjit Singh Thandi;Abdechakour Mechri;Merouane Debbah;Lucas C. Cordeiro","doi":"10.1109/TSE.2025.3548168","DOIUrl":null,"url":null,"abstract":"Software vulnerabilities can cause numerous problems, including crashes, data loss, and security breaches. These issues greatly compromise quality and can negatively impact the market adoption of software applications and systems. Traditional bug-fixing methods, such as static analysis, often produce false positives. While bounded model checking, a form of Formal Verification (FV), can provide more accurate outcomes compared to static analyzers, it demands substantial resources and significantly hinders developer productivity. Can Machine Learning (ML) achieve accuracy comparable to FV methods and be used in popular instant code completion frameworks in near real-time? In this paper, we introduce <monospace>SecureFalcon</monospace>, an innovative model architecture with only 121 million parameters derived from the Falcon-40B model and explicitly tailored for classifying software vulnerabilities. To achieve the best performance, we trained our model using two datasets, namely the FormAI dataset and the FalconVulnDB. The FalconVulnDB is a combination of recent public datasets, namely the SySeVR framework, Draper VDISC, Bigvul, Diversevul, SARD Juliet, and ReVeal datasets. These datasets contain the top 25 most dangerous software weaknesses, such as CWE-119, CWE-120, CWE-476, CWE-122, CWE-190, CWE-121, CWE-78, CWE-787, CWE-20, and CWE-762. <monospace>SecureFalcon</monospace> achieves 94% accuracy in binary classification and up to 92% in multiclassification, with instant CPU inference times. It outperforms existing models such as BERT, RoBERTa, CodeBERT, and traditional ML algorithms, promising to push the boundaries of software vulnerability detection and instant code completion frameworks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 4","pages":"1248-1265"},"PeriodicalIF":6.5000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10910240/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Software vulnerabilities can cause numerous problems, including crashes, data loss, and security breaches. These issues greatly compromise quality and can negatively impact the market adoption of software applications and systems. Traditional bug-fixing methods, such as static analysis, often produce false positives. While bounded model checking, a form of Formal Verification (FV), can provide more accurate outcomes compared to static analyzers, it demands substantial resources and significantly hinders developer productivity. Can Machine Learning (ML) achieve accuracy comparable to FV methods and be used in popular instant code completion frameworks in near real-time? In this paper, we introduce SecureFalcon, an innovative model architecture with only 121 million parameters derived from the Falcon-40B model and explicitly tailored for classifying software vulnerabilities. To achieve the best performance, we trained our model using two datasets, namely the FormAI dataset and the FalconVulnDB. The FalconVulnDB is a combination of recent public datasets, namely the SySeVR framework, Draper VDISC, Bigvul, Diversevul, SARD Juliet, and ReVeal datasets. These datasets contain the top 25 most dangerous software weaknesses, such as CWE-119, CWE-120, CWE-476, CWE-122, CWE-190, CWE-121, CWE-78, CWE-787, CWE-20, and CWE-762. SecureFalcon achieves 94% accuracy in binary classification and up to 92% in multiclassification, with instant CPU inference times. It outperforms existing models such as BERT, RoBERTa, CodeBERT, and traditional ML algorithms, promising to push the boundaries of software vulnerability detection and instant code completion frameworks.

查看原文本刊更多论文

SecureFalcon：我们是否已经实现了llm自动软件漏洞检测？

软件漏洞可能导致许多问题，包括崩溃、数据丢失和安全漏洞。这些问题极大地损害了质量，并可能对软件应用程序和系统的市场采用产生负面影响。传统的bug修复方法，比如静态分析，经常会产生误报。虽然有界模型检查（形式验证（Formal Verification， FV）的一种形式）可以提供比静态分析器更准确的结果，但它需要大量的资源，并且极大地阻碍了开发人员的生产力。机器学习（ML）能否达到与FV方法相当的准确性，并在流行的即时代码完成框架中近乎实时地使用？在本文中，我们介绍了SecureFalcon，这是一种创新的模型架构，只有1.21亿个参数，源自Falcon-40B模型，并明确为软件漏洞分类量身定制。为了达到最佳性能，我们使用两个数据集来训练我们的模型，即FormAI数据集和fal惊天数据库。fal惊厥数据库是最近的公共数据集的组合，即SySeVR框架、Draper VDISC、Bigvul、Diversevul、SARD Juliet和ReVeal数据集。这些数据集包含了前25个最危险的软件漏洞，如CWE-119、CWE-120、CWE-476、CWE-122、CWE-190、CWE-121、CWE-78、CWE-787、CWE-20和CWE-762。SecureFalcon在二进制分类中达到94%的准确率，在多分类中达到92%的准确率，CPU推理时间即时。它优于现有的模型，如BERT、RoBERTa、CodeBERT和传统的ML算法，有望突破软件漏洞检测和即时代码完成框架的界限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.