{"title":"SCsVulSegLytix: Detecting and extracting vulnerable segments from smart contracts using weakly-supervised learning","authors":"Borna Ahmadzadeh , Arousha Haghighian Roudsari , Sepideh HajiHosseinKhani , Arash Habibi Lashkari","doi":"10.1016/j.jss.2025.112532","DOIUrl":null,"url":null,"abstract":"<div><div>Smart contracts (SCs), self-executing digital contracts deployed on blockchain networks, are becoming increasingly more prevalent in various sectors, such as finance, thanks to their automation, transparency, and cost efficiency. Given the substantial size of assets managed by them, SCs have become attractive targets for hackers, who exploit vulnerabilities in them to steal funds. Blockchain’s inherent immutability means vulnerabilities cannot be fixed quickly, and the immaturity of the Solidity programming language, which introduces potential security threats to SCs, exacerbates this problem. As such, there is a pressing need to develop security measures to identify vulnerabilities in SCs. Non-learning-based detection methods utilizing heuristics designed by experts often cannot handle the evolving complexity of SC vulnerabilities. In contrast, though typically outperforming non-learning-based solutions, learning-based solutions generally do not pinpoint the locations of vulnerabilities in SCs. Learning-based approaches that identify the locations of vulnerabilities come with several challenges: First, they convert SCs into graphs, incurring computational overhead and making the learning system more complex. Second, most require line- or function-level labels to be trained, which are difficult to gather. Lastly, their coverage of vulnerability types is not extensive, exposing the user to vulnerabilities not covered by them. This work presents SCsVulSegLytix, a learning-based approach for detecting and extracting vulnerable segments in SCs. SCsVulSegLytix uses a source code-based Transformer model trained with contract-level labels to classify entire contracts as vulnerable, followed by a post-hoc interpretability method to extract vulnerable segments in SCs according to relevance scores. Unlike previous extraction models, SCsVulSegLytix requires no line-level annotations and can be trained using contract-wide labels only, which are much easier to collect. Moreover, it operates directly on Solidity source code, substantially improving efficiency compared to expensive graph-based models. Finally, it extends support to several important classes of SC vulnerabilities, meaning developers are protected against various potential attacks. Experiments show that our model outperforms existing models concerning both contract- and line-level vulnerability identification while achieving greater computation efficiency.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112532"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002006","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Smart contracts (SCs), self-executing digital contracts deployed on blockchain networks, are becoming increasingly more prevalent in various sectors, such as finance, thanks to their automation, transparency, and cost efficiency. Given the substantial size of assets managed by them, SCs have become attractive targets for hackers, who exploit vulnerabilities in them to steal funds. Blockchain’s inherent immutability means vulnerabilities cannot be fixed quickly, and the immaturity of the Solidity programming language, which introduces potential security threats to SCs, exacerbates this problem. As such, there is a pressing need to develop security measures to identify vulnerabilities in SCs. Non-learning-based detection methods utilizing heuristics designed by experts often cannot handle the evolving complexity of SC vulnerabilities. In contrast, though typically outperforming non-learning-based solutions, learning-based solutions generally do not pinpoint the locations of vulnerabilities in SCs. Learning-based approaches that identify the locations of vulnerabilities come with several challenges: First, they convert SCs into graphs, incurring computational overhead and making the learning system more complex. Second, most require line- or function-level labels to be trained, which are difficult to gather. Lastly, their coverage of vulnerability types is not extensive, exposing the user to vulnerabilities not covered by them. This work presents SCsVulSegLytix, a learning-based approach for detecting and extracting vulnerable segments in SCs. SCsVulSegLytix uses a source code-based Transformer model trained with contract-level labels to classify entire contracts as vulnerable, followed by a post-hoc interpretability method to extract vulnerable segments in SCs according to relevance scores. Unlike previous extraction models, SCsVulSegLytix requires no line-level annotations and can be trained using contract-wide labels only, which are much easier to collect. Moreover, it operates directly on Solidity source code, substantially improving efficiency compared to expensive graph-based models. Finally, it extends support to several important classes of SC vulnerabilities, meaning developers are protected against various potential attacks. Experiments show that our model outperforms existing models concerning both contract- and line-level vulnerability identification while achieving greater computation efficiency.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.