The fire tries gold: Evaluating pre-trained language models for multi-label vulnerability detection in ethereum smart contracts

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-09-22 DOI:10.1016/j.jss.2025.112642

Trung Kien Luu, Doan Minh Trung, Tuan-Dung Tran, Phan The Duy, Van-Hau Pham

{"title":"The fire tries gold: Evaluating pre-trained language models for multi-label vulnerability detection in ethereum smart contracts","authors":"Trung Kien Luu, Doan Minh Trung, Tuan-Dung Tran, Phan The Duy, Van-Hau Pham","doi":"10.1016/j.jss.2025.112642","DOIUrl":null,"url":null,"abstract":"<div><div>Smart contracts are integral components of blockchain ecosystems, yet they remain highly susceptible to security vulnerabilities that can lead to severe financial and operational consequences. To address this, a range of vulnerability detection techniques have been developed, including rule-based tools, neural network models, pre-trained language models (PLMs), and most recently, large language models (LLMs). However, those existing methods face three main limitations: (1) Rule-based tools such as Slither and Oyente depend heavily on handcrafted heuristics, requiring human intervention and high execution time. (2) LLM-based approaches are computationally expensive and challenging to fine-tune in resource-constrained environments, particularly within academic or research settings where access to high-performance computing is constrained. (3) Most existing approaches focus on binary and multi-class classification, assuming that each contract contains only a single vulnerability, whereas in practice, smart contracts often exhibit multiple coexisting vulnerabilities that require a multi-label detection approach. In this study, we conduct a comprehensive benchmark that systematically evaluates the effectiveness of traditional deep learning models (e.g., LSTM, BiLSTM) versus state-of-the-art PLMs (e.g., CodeBERT, GraphCodeBERT) in multi-label vulnerability detection. Our dataset comprises nearly 18,000 real-world smart contracts annotated with seven distinct vulnerability types. We evaluate not only detection accuracy but also computational efficiency, including training time, inference speed, and resource consumption. Our findings reveal a crucial trade-off: while code-specialized PLMs like GraphCodeBERT achieve a high F1-score of 96%, a well-tuned BiLSTM with an attention mechanism surpasses it (98% F1-score) with significantly less training time. By providing a clear, evidence-based framework, this research offers practical recommendations for engineers to select the most appropriate model, balancing state-of-the-art performance with the resource constraints inherent in real-world security tools.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112642"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225003115","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Smart contracts are integral components of blockchain ecosystems, yet they remain highly susceptible to security vulnerabilities that can lead to severe financial and operational consequences. To address this, a range of vulnerability detection techniques have been developed, including rule-based tools, neural network models, pre-trained language models (PLMs), and most recently, large language models (LLMs). However, those existing methods face three main limitations: (1) Rule-based tools such as Slither and Oyente depend heavily on handcrafted heuristics, requiring human intervention and high execution time. (2) LLM-based approaches are computationally expensive and challenging to fine-tune in resource-constrained environments, particularly within academic or research settings where access to high-performance computing is constrained. (3) Most existing approaches focus on binary and multi-class classification, assuming that each contract contains only a single vulnerability, whereas in practice, smart contracts often exhibit multiple coexisting vulnerabilities that require a multi-label detection approach. In this study, we conduct a comprehensive benchmark that systematically evaluates the effectiveness of traditional deep learning models (e.g., LSTM, BiLSTM) versus state-of-the-art PLMs (e.g., CodeBERT, GraphCodeBERT) in multi-label vulnerability detection. Our dataset comprises nearly 18,000 real-world smart contracts annotated with seven distinct vulnerability types. We evaluate not only detection accuracy but also computational efficiency, including training time, inference speed, and resource consumption. Our findings reveal a crucial trade-off: while code-specialized PLMs like GraphCodeBERT achieve a high F1-score of 96%, a well-tuned BiLSTM with an attention mechanism surpasses it (98% F1-score) with significantly less training time. By providing a clear, evidence-based framework, this research offers practical recommendations for engineers to select the most appropriate model, balancing state-of-the-art performance with the resource constraints inherent in real-world security tools.

查看原文本刊更多论文

火试金：评估以太坊智能合约中用于多标签漏洞检测的预训练语言模型

智能合约是区块链生态系统的组成部分，但它们仍然极易受到安全漏洞的影响，可能导致严重的财务和运营后果。为了解决这个问题，已经开发了一系列漏洞检测技术，包括基于规则的工具，神经网络模型，预训练语言模型（plm），以及最近的大型语言模型（llm）。然而，这些现有的方法面临着三个主要的局限性：(1)基于规则的工具，如Slither和Oyente，严重依赖于手工制作的启发式，需要人工干预和高执行时间。(2)基于llm的方法在资源受限的环境中计算成本高，难以进行微调，特别是在访问高性能计算受限的学术或研究环境中。(3)大多数现有方法侧重于二元和多类分类，假设每个合约只包含一个漏洞，而在实践中，智能合约通常表现出多个共存的漏洞，需要多标签检测方法。在本研究中，我们进行了一个全面的基准测试，系统地评估了传统深度学习模型（如LSTM、BiLSTM）与最先进的plm（如CodeBERT、GraphCodeBERT）在多标签漏洞检测中的有效性。我们的数据集包括近18,000个真实世界的智能合约，其中标注了七种不同的漏洞类型。我们不仅评估检测精度，还评估计算效率，包括训练时间、推理速度和资源消耗。我们的研究结果揭示了一个关键的权衡：虽然像GraphCodeBERT这样的代码专用plm达到了96%的高f1分数，但一个经过良好调整的带有注意力机制的BiLSTM在训练时间显著减少的情况下超过了它（98%的f1分数）。通过提供一个清晰的、基于证据的框架，本研究为工程师选择最合适的模型提供了实用的建议，从而平衡了现实世界安全工具中最先进的性能和固有的资源约束。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.