Trung Kien Luu, Doan Minh Trung, Tuan-Dung Tran, Phan The Duy, Van-Hau Pham
{"title":"火试金:评估以太坊智能合约中用于多标签漏洞检测的预训练语言模型","authors":"Trung Kien Luu, Doan Minh Trung, Tuan-Dung Tran, Phan The Duy, Van-Hau Pham","doi":"10.1016/j.jss.2025.112642","DOIUrl":null,"url":null,"abstract":"<div><div>Smart contracts are integral components of blockchain ecosystems, yet they remain highly susceptible to security vulnerabilities that can lead to severe financial and operational consequences. To address this, a range of vulnerability detection techniques have been developed, including rule-based tools, neural network models, pre-trained language models (PLMs), and most recently, large language models (LLMs). However, those existing methods face three main limitations: (1) Rule-based tools such as Slither and Oyente depend heavily on handcrafted heuristics, requiring human intervention and high execution time. (2) LLM-based approaches are computationally expensive and challenging to fine-tune in resource-constrained environments, particularly within academic or research settings where access to high-performance computing is constrained. (3) Most existing approaches focus on binary and multi-class classification, assuming that each contract contains only a single vulnerability, whereas in practice, smart contracts often exhibit multiple coexisting vulnerabilities that require a multi-label detection approach. In this study, we conduct a comprehensive benchmark that systematically evaluates the effectiveness of traditional deep learning models (e.g., LSTM, BiLSTM) versus state-of-the-art PLMs (e.g., CodeBERT, GraphCodeBERT) in multi-label vulnerability detection. Our dataset comprises nearly 18,000 real-world smart contracts annotated with seven distinct vulnerability types. We evaluate not only detection accuracy but also computational efficiency, including training time, inference speed, and resource consumption. Our findings reveal a crucial trade-off: while code-specialized PLMs like GraphCodeBERT achieve a high F1-score of 96%, a well-tuned BiLSTM with an attention mechanism surpasses it (98% F1-score) with significantly less training time. By providing a clear, evidence-based framework, this research offers practical recommendations for engineers to select the most appropriate model, balancing state-of-the-art performance with the resource constraints inherent in real-world security tools.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"231 ","pages":"Article 112642"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The fire tries gold: Evaluating pre-trained language models for multi-label vulnerability detection in ethereum smart contracts\",\"authors\":\"Trung Kien Luu, Doan Minh Trung, Tuan-Dung Tran, Phan The Duy, Van-Hau Pham\",\"doi\":\"10.1016/j.jss.2025.112642\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Smart contracts are integral components of blockchain ecosystems, yet they remain highly susceptible to security vulnerabilities that can lead to severe financial and operational consequences. To address this, a range of vulnerability detection techniques have been developed, including rule-based tools, neural network models, pre-trained language models (PLMs), and most recently, large language models (LLMs). However, those existing methods face three main limitations: (1) Rule-based tools such as Slither and Oyente depend heavily on handcrafted heuristics, requiring human intervention and high execution time. (2) LLM-based approaches are computationally expensive and challenging to fine-tune in resource-constrained environments, particularly within academic or research settings where access to high-performance computing is constrained. (3) Most existing approaches focus on binary and multi-class classification, assuming that each contract contains only a single vulnerability, whereas in practice, smart contracts often exhibit multiple coexisting vulnerabilities that require a multi-label detection approach. In this study, we conduct a comprehensive benchmark that systematically evaluates the effectiveness of traditional deep learning models (e.g., LSTM, BiLSTM) versus state-of-the-art PLMs (e.g., CodeBERT, GraphCodeBERT) in multi-label vulnerability detection. Our dataset comprises nearly 18,000 real-world smart contracts annotated with seven distinct vulnerability types. We evaluate not only detection accuracy but also computational efficiency, including training time, inference speed, and resource consumption. Our findings reveal a crucial trade-off: while code-specialized PLMs like GraphCodeBERT achieve a high F1-score of 96%, a well-tuned BiLSTM with an attention mechanism surpasses it (98% F1-score) with significantly less training time. By providing a clear, evidence-based framework, this research offers practical recommendations for engineers to select the most appropriate model, balancing state-of-the-art performance with the resource constraints inherent in real-world security tools.</div></div>\",\"PeriodicalId\":51099,\"journal\":{\"name\":\"Journal of Systems and Software\",\"volume\":\"231 \",\"pages\":\"Article 112642\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems and Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0164121225003115\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225003115","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
The fire tries gold: Evaluating pre-trained language models for multi-label vulnerability detection in ethereum smart contracts
Smart contracts are integral components of blockchain ecosystems, yet they remain highly susceptible to security vulnerabilities that can lead to severe financial and operational consequences. To address this, a range of vulnerability detection techniques have been developed, including rule-based tools, neural network models, pre-trained language models (PLMs), and most recently, large language models (LLMs). However, those existing methods face three main limitations: (1) Rule-based tools such as Slither and Oyente depend heavily on handcrafted heuristics, requiring human intervention and high execution time. (2) LLM-based approaches are computationally expensive and challenging to fine-tune in resource-constrained environments, particularly within academic or research settings where access to high-performance computing is constrained. (3) Most existing approaches focus on binary and multi-class classification, assuming that each contract contains only a single vulnerability, whereas in practice, smart contracts often exhibit multiple coexisting vulnerabilities that require a multi-label detection approach. In this study, we conduct a comprehensive benchmark that systematically evaluates the effectiveness of traditional deep learning models (e.g., LSTM, BiLSTM) versus state-of-the-art PLMs (e.g., CodeBERT, GraphCodeBERT) in multi-label vulnerability detection. Our dataset comprises nearly 18,000 real-world smart contracts annotated with seven distinct vulnerability types. We evaluate not only detection accuracy but also computational efficiency, including training time, inference speed, and resource consumption. Our findings reveal a crucial trade-off: while code-specialized PLMs like GraphCodeBERT achieve a high F1-score of 96%, a well-tuned BiLSTM with an attention mechanism surpasses it (98% F1-score) with significantly less training time. By providing a clear, evidence-based framework, this research offers practical recommendations for engineers to select the most appropriate model, balancing state-of-the-art performance with the resource constraints inherent in real-world security tools.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.