{"title":"Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities","authors":"Md Tauseef Alam, Raju Halder, Abyayananda Maiti","doi":"arxiv-2409.10574","DOIUrl":null,"url":null,"abstract":"The large-scale deployment of Solidity smart contracts on the Ethereum\nmainnet has increasingly attracted financially-motivated attackers in recent\nyears. A few now-infamous attacks in Ethereum's history includes DAO attack in\n2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars\nlocked), Beautychain's token BEC in 2018 (900 million dollars market value fell\nto 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen).\nThis paper presents a comprehensive investigation of the use of large language\nmodels (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities\nin Solidity. We introduce a novel, class-balanced, structured, and labeled\ndataset named VulSmart, which we use to benchmark and compare the performance\nof open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside\nclosed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD\nframework is rigorously tested against these models through extensive automated\nand manual evaluations, utilizing BLEU and ROUGE metrics to assess the\neffectiveness of vulnerability detection in smart contracts. We also explore\nthree distinct prompting strategies-zero-shot, few-shot, and\nchain-of-thought-to evaluate the multi-class classification and generative\ncapabilities of the SmartVD framework. Our findings reveal that SmartVD\noutperforms its open-source counterparts and even exceeds the performance of\nclosed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the\nclosed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable\nperformance with 99% accuracy in detecting vulnerabilities, 94% in identifying\ntheir types, and 98% in determining severity. Notably, SmartVD performs best\nwith the `chain-of-thought' prompting technique, whereas the fine-tuned\nclosed-source models excel with the `zero-shot' prompting approach.","PeriodicalId":501168,"journal":{"name":"arXiv - CS - Emerging Technologies","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The large-scale deployment of Solidity smart contracts on the Ethereum
mainnet has increasingly attracted financially-motivated attackers in recent
years. A few now-infamous attacks in Ethereum's history includes DAO attack in
2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars
locked), Beautychain's token BEC in 2018 (900 million dollars market value fell
to 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen).
This paper presents a comprehensive investigation of the use of large language
models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities
in Solidity. We introduce a novel, class-balanced, structured, and labeled
dataset named VulSmart, which we use to benchmark and compare the performance
of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside
closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD
framework is rigorously tested against these models through extensive automated
and manual evaluations, utilizing BLEU and ROUGE metrics to assess the
effectiveness of vulnerability detection in smart contracts. We also explore
three distinct prompting strategies-zero-shot, few-shot, and
chain-of-thought-to evaluate the multi-class classification and generative
capabilities of the SmartVD framework. Our findings reveal that SmartVD
outperforms its open-source counterparts and even exceeds the performance of
closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the
closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable
performance with 99% accuracy in detecting vulnerabilities, 94% in identifying
their types, and 98% in determining severity. Notably, SmartVD performs best
with the `chain-of-thought' prompting technique, whereas the fine-tuned
closed-source models excel with the `zero-shot' prompting approach.