Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

Md Tauseef Alam, Raju Halder, Abyayananda Maiti
{"title":"Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities","authors":"Md Tauseef Alam, Raju Halder, Abyayananda Maiti","doi":"arxiv-2409.10574","DOIUrl":null,"url":null,"abstract":"The large-scale deployment of Solidity smart contracts on the Ethereum\nmainnet has increasingly attracted financially-motivated attackers in recent\nyears. A few now-infamous attacks in Ethereum's history includes DAO attack in\n2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars\nlocked), Beautychain's token BEC in 2018 (900 million dollars market value fell\nto 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen).\nThis paper presents a comprehensive investigation of the use of large language\nmodels (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities\nin Solidity. We introduce a novel, class-balanced, structured, and labeled\ndataset named VulSmart, which we use to benchmark and compare the performance\nof open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside\nclosed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD\nframework is rigorously tested against these models through extensive automated\nand manual evaluations, utilizing BLEU and ROUGE metrics to assess the\neffectiveness of vulnerability detection in smart contracts. We also explore\nthree distinct prompting strategies-zero-shot, few-shot, and\nchain-of-thought-to evaluate the multi-class classification and generative\ncapabilities of the SmartVD framework. Our findings reveal that SmartVD\noutperforms its open-source counterparts and even exceeds the performance of\nclosed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the\nclosed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable\nperformance with 99% accuracy in detecting vulnerabilities, 94% in identifying\ntheir types, and 98% in determining severity. Notably, SmartVD performs best\nwith the `chain-of-thought' prompting technique, whereas the fine-tuned\nclosed-source models excel with the `zero-shot' prompting approach.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The large-scale deployment of Solidity smart contracts on the Ethereum mainnet has increasingly attracted financially-motivated attackers in recent years. A few now-infamous attacks in Ethereum's history includes DAO attack in 2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars locked), Beautychain's token BEC in 2018 (900 million dollars market value fell to 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen). This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities in Solidity. We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use to benchmark and compare the performance of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD framework is rigorously tested against these models through extensive automated and manual evaluations, utilizing BLEU and ROUGE metrics to assess the effectiveness of vulnerability detection in smart contracts. We also explore three distinct prompting strategies-zero-shot, few-shot, and chain-of-thought-to evaluate the multi-class classification and generative capabilities of the SmartVD framework. Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable performance with 99% accuracy in detecting vulnerabilities, 94% in identifying their types, and 98% in determining severity. Notably, SmartVD performs best with the `chain-of-thought' prompting technique, whereas the fine-tuned closed-source models excel with the `zero-shot' prompting approach.
轻松检测:大型语言模型在解决固体漏洞方面的潜力
近年来,以太坊主网上 Solidity 智能合约的大规模部署越来越多地吸引了出于经济动机的攻击者。以太坊历史上几起著名的攻击事件包括 2016 年的 DAO 攻击(损失 5000 万美元)、2017 年的 Parity Wallet 黑客攻击(1.46 亿美元被锁定)、2018 年的 Beautychain 代币 BEC(9 亿美元市值跌至 0)以及 2022 年的 NFT 游戏区块链漏洞(6 亿美元以太币被盗)。我们引入了一个名为 VulSmart 的新颖、类平衡、结构化和标签化数据集,并利用它对 CodeLlama、Llama2、CodeT5 和 Falcon 等开源 LLM,以及 GPT-3.5 Turbo 和 GPT-4o Mini 等封闭源模型的性能进行了基准测试和比较。我们提出的 SmartVD 框架通过广泛的自动和手动评估针对这些模型进行了严格测试,利用 BLEU 和 ROUGE 指标来评估智能合约中漏洞检测的有效性。我们还探索了三种不同的提示策略--零枪、少枪和思维链,以评估 SmartVD 框架的多类分类和生成能力。我们的研究结果表明,SmartVD 的性能优于其开源模型,甚至超过了 GPT-3.5 和 GPT-4 Mini 等闭源基础模型。经过微调后,GPT-3.5 Turbo 和 GPT-4o Mini 等闭源模型取得了显著的性能,检测漏洞的准确率达到 99%,识别漏洞类型的准确率达到 94%,判断漏洞严重性的准确率达到 98%。值得注意的是,SmartVD 在使用 "思维链 "提示技术时表现最佳,而经过微调的闭源模型在使用 "零镜头 "提示方法时表现出色。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信