Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

arXiv - CS - Emerging Technologies Pub Date : 2024-09-15 DOI:arxiv-2409.10574

Md Tauseef Alam, Raju Halder, Abyayananda Maiti

{"title":"Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities","authors":"Md Tauseef Alam, Raju Halder, Abyayananda Maiti","doi":"arxiv-2409.10574","DOIUrl":null,"url":null,"abstract":"The large-scale deployment of Solidity smart contracts on the Ethereum\nmainnet has increasingly attracted financially-motivated attackers in recent\nyears. A few now-infamous attacks in Ethereum's history includes DAO attack in\n2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars\nlocked), Beautychain's token BEC in 2018 (900 million dollars market value fell\nto 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen).\nThis paper presents a comprehensive investigation of the use of large language\nmodels (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities\nin Solidity. We introduce a novel, class-balanced, structured, and labeled\ndataset named VulSmart, which we use to benchmark and compare the performance\nof open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside\nclosed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD\nframework is rigorously tested against these models through extensive automated\nand manual evaluations, utilizing BLEU and ROUGE metrics to assess the\neffectiveness of vulnerability detection in smart contracts. We also explore\nthree distinct prompting strategies-zero-shot, few-shot, and\nchain-of-thought-to evaluate the multi-class classification and generative\ncapabilities of the SmartVD framework. Our findings reveal that SmartVD\noutperforms its open-source counterparts and even exceeds the performance of\nclosed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the\nclosed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable\nperformance with 99% accuracy in detecting vulnerabilities, 94% in identifying\ntheir types, and 98% in determining severity. Notably, SmartVD performs best\nwith the `chain-of-thought' prompting technique, whereas the fine-tuned\nclosed-source models excel with the `zero-shot' prompting approach.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The large-scale deployment of Solidity smart contracts on the Ethereum mainnet has increasingly attracted financially-motivated attackers in recent years. A few now-infamous attacks in Ethereum's history includes DAO attack in 2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars locked), Beautychain's token BEC in 2018 (900 million dollars market value fell to 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen). This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities in Solidity. We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use to benchmark and compare the performance of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD framework is rigorously tested against these models through extensive automated and manual evaluations, utilizing BLEU and ROUGE metrics to assess the effectiveness of vulnerability detection in smart contracts. We also explore three distinct prompting strategies-zero-shot, few-shot, and chain-of-thought-to evaluate the multi-class classification and generative capabilities of the SmartVD framework. Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable performance with 99% accuracy in detecting vulnerabilities, 94% in identifying their types, and 98% in determining severity. Notably, SmartVD performs best with the `chain-of-thought' prompting technique, whereas the fine-tuned closed-source models excel with the `zero-shot' prompting approach.

查看原文本刊更多论文

轻松检测：大型语言模型在解决固体漏洞方面的潜力

近年来，以太坊主网上 Solidity 智能合约的大规模部署越来越多地吸引了出于经济动机的攻击者。以太坊历史上几起著名的攻击事件包括 2016 年的 DAO 攻击（损失 5000 万美元）、2017 年的 Parity Wallet 黑客攻击（1.46 亿美元被锁定）、2018 年的 Beautychain 代币 BEC（9 亿美元市值跌至 0）以及 2022 年的 NFT 游戏区块链漏洞（6 亿美元以太币被盗）。我们引入了一个名为 VulSmart 的新颖、类平衡、结构化和标签化数据集，并利用它对 CodeLlama、Llama2、CodeT5 和 Falcon 等开源 LLM，以及 GPT-3.5 Turbo 和 GPT-4o Mini 等封闭源模型的性能进行了基准测试和比较。我们提出的 SmartVD 框架通过广泛的自动和手动评估针对这些模型进行了严格测试，利用 BLEU 和 ROUGE 指标来评估智能合约中漏洞检测的有效性。我们还探索了三种不同的提示策略--零枪、少枪和思维链，以评估 SmartVD 框架的多类分类和生成能力。我们的研究结果表明，SmartVD 的性能优于其开源模型，甚至超过了 GPT-3.5 和 GPT-4 Mini 等闭源基础模型。经过微调后，GPT-3.5 Turbo 和 GPT-4o Mini 等闭源模型取得了显著的性能，检测漏洞的准确率达到 99%，识别漏洞类型的准确率达到 94%，判断漏洞严重性的准确率达到 98%。值得注意的是，SmartVD 在使用 "思维链 "提示技术时表现最佳，而经过微调的闭源模型在使用 "零镜头 "提示方法时表现出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Emerging Technologies

自引率

0.00%

发文量