Van Tong, Cuong Dao, Hai-Anh Tran, Truong X. Tran, Sami Souihi
{"title":"Enhancing BERT-Based Language Model for Multi-label Vulnerability Detection of Smart Contract in Blockchain","authors":"Van Tong, Cuong Dao, Hai-Anh Tran, Truong X. Tran, Sami Souihi","doi":"10.1007/s10922-024-09832-w","DOIUrl":null,"url":null,"abstract":"<p>Smart contracts are decentralized applications that hold a pivotal role in blockchain-based systems. Smart contracts are composed of error-prone programming languages, so it is affected by many vulnerabilities (e.g., time dependence, outdated version, etc.), which can result in a substantial economic loss within the blockchain ecosystem. Therefore, many vulnerability detection tools are designed to detect the vulnerabilities in smart contracts such as Slither, Mythrill and so forth. However, these tools require high processing time and cannot achieve good accuracy with complex smart contracts nowadays. Consequently, many studies have shifted towards using Deep Learning (DL) techniques, which consider bytecode to determine vulnerabilities in smart contracts. However, these mechanisms reveal three main limitations. First, these mechanisms focus on multi-class problems, assuming that a given smart contract contains only a single vulnerability while the smart contract can contain more than one vulnerability. Second, these approaches encounter ineffective word embedding with large input sequences. Third, the learning model in these mechanisms is forced to classify into one of pre-defined labels even when it cannot make decisions accurately, leading to misclassifications. Therefore, in this paper, we propose a multi-label vulnerability classification mechanism using a language model. To deal with the ineffective word embedding, the proposed mechanism not only takes into account the implicit features derived from the language models (e.g., SecBERT, etc.) but also auxiliary features extracted from other word embedding techniques (e.g., TF-IDF, etc.). Besides, a trustworthy neural network model is proposed to reduce the misclassification rate of vulnerability classification. In detail, an additional neuron is added to the output of the model to indicate whether the model is able to make decisions accurately or not. The experimental results illustrate that the trustworthy model outperforms benchmarks (e.g., binary relevance, label powerset, classifier chain, etc.), achieving up to approximately 98% f1-score while requiring low execution time with 26 ms.</p>","PeriodicalId":50119,"journal":{"name":"Journal of Network and Systems Management","volume":"62 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Systems Management","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10922-024-09832-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Smart contracts are decentralized applications that hold a pivotal role in blockchain-based systems. Smart contracts are composed of error-prone programming languages, so it is affected by many vulnerabilities (e.g., time dependence, outdated version, etc.), which can result in a substantial economic loss within the blockchain ecosystem. Therefore, many vulnerability detection tools are designed to detect the vulnerabilities in smart contracts such as Slither, Mythrill and so forth. However, these tools require high processing time and cannot achieve good accuracy with complex smart contracts nowadays. Consequently, many studies have shifted towards using Deep Learning (DL) techniques, which consider bytecode to determine vulnerabilities in smart contracts. However, these mechanisms reveal three main limitations. First, these mechanisms focus on multi-class problems, assuming that a given smart contract contains only a single vulnerability while the smart contract can contain more than one vulnerability. Second, these approaches encounter ineffective word embedding with large input sequences. Third, the learning model in these mechanisms is forced to classify into one of pre-defined labels even when it cannot make decisions accurately, leading to misclassifications. Therefore, in this paper, we propose a multi-label vulnerability classification mechanism using a language model. To deal with the ineffective word embedding, the proposed mechanism not only takes into account the implicit features derived from the language models (e.g., SecBERT, etc.) but also auxiliary features extracted from other word embedding techniques (e.g., TF-IDF, etc.). Besides, a trustworthy neural network model is proposed to reduce the misclassification rate of vulnerability classification. In detail, an additional neuron is added to the output of the model to indicate whether the model is able to make decisions accurately or not. The experimental results illustrate that the trustworthy model outperforms benchmarks (e.g., binary relevance, label powerset, classifier chain, etc.), achieving up to approximately 98% f1-score while requiring low execution time with 26 ms.
期刊介绍:
Journal of Network and Systems Management, features peer-reviewed original research, as well as case studies in the fields of network and system management. The journal regularly disseminates significant new information on both the telecommunications and computing aspects of these fields, as well as their evolution and emerging integration. This outstanding quarterly covers architecture, analysis, design, software, standards, and migration issues related to the operation, management, and control of distributed systems and communication networks for voice, data, video, and networked computing.