Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen
{"title":"SolBERT: Advancing solidity smart contract similarity analysis via self-supervised pre-training and contrastive fine-tuning","authors":"Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen","doi":"10.1016/j.infsof.2025.107766","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Reliable and effective similarity analysis for the smart contracts facilitates the maintenance and quality assurance of the smart contract ecosystem. However, existing signature-based methods and code representation learning-based methods suffer from limitations such as heavy-weight program analysis payloads or suboptimal contract encodings.</div></div><div><h3>Objective:</h3><div>This paper aims to design a fully unsupervised language model for better capturing the syntactic and semantic richness of Solidity code, and utilizes it for advancing the effectiveness of smart contract similarity analysis.</div></div><div><h3>Methods:</h3><div>Inspired by the impressive semantic learning capability of pre-trained language models (PLMs), we propose SolBERT, a PLM specifically tailored for enhancing Solidity smart contracts similarity detection. To ensure it produces high-quality encodings, SolBERT leverages BERT-style pre-training with the masked language modeling (MLM) and token type prediction (TTP) tasks applied on code-structure-aware token sequences derived from the contracts’ abstract syntax trees (ASTs) through structure-retaining tree linearization and light-weight normalization to learn a base model. On this basis, self-supervised contrastive fine-tuning and unsupervised whitening operations are further performed to optimize contract encoding generation.</div></div><div><h3>Results:</h3><div>Experiments are conducted on three contract similarity-related tasks, including contract clone detection, bug detection, and code clustering. The results indicate that SolBERT significantly outperforms state-of-the-art approaches with average absolute gains of 21.33% and 21.50% in terms of F1, and 17.78% and 26.60% in terms of accuracy for the clone detection and bug detection tasks, respectively; and an average absolute gain of 17.97% for code clustering task. When applying both contrastive fine-tuning and whitening optimizations, SolBERT also shows superior performance than the case of lacking any of them.</div></div><div><h3>Conclusion:</h3><div>The proposed approach, SolBERT, can serve as a reliable and powerful smart contract encoder, better capturing the syntactic and semantic aspects of the Solidity code. The results and findings also validate the effectiveness and positive synergistic effect of SolBERT’s encoding optimization operations.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107766"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001053","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Context:
Reliable and effective similarity analysis for the smart contracts facilitates the maintenance and quality assurance of the smart contract ecosystem. However, existing signature-based methods and code representation learning-based methods suffer from limitations such as heavy-weight program analysis payloads or suboptimal contract encodings.
Objective:
This paper aims to design a fully unsupervised language model for better capturing the syntactic and semantic richness of Solidity code, and utilizes it for advancing the effectiveness of smart contract similarity analysis.
Methods:
Inspired by the impressive semantic learning capability of pre-trained language models (PLMs), we propose SolBERT, a PLM specifically tailored for enhancing Solidity smart contracts similarity detection. To ensure it produces high-quality encodings, SolBERT leverages BERT-style pre-training with the masked language modeling (MLM) and token type prediction (TTP) tasks applied on code-structure-aware token sequences derived from the contracts’ abstract syntax trees (ASTs) through structure-retaining tree linearization and light-weight normalization to learn a base model. On this basis, self-supervised contrastive fine-tuning and unsupervised whitening operations are further performed to optimize contract encoding generation.
Results:
Experiments are conducted on three contract similarity-related tasks, including contract clone detection, bug detection, and code clustering. The results indicate that SolBERT significantly outperforms state-of-the-art approaches with average absolute gains of 21.33% and 21.50% in terms of F1, and 17.78% and 26.60% in terms of accuracy for the clone detection and bug detection tasks, respectively; and an average absolute gain of 17.97% for code clustering task. When applying both contrastive fine-tuning and whitening optimizations, SolBERT also shows superior performance than the case of lacking any of them.
Conclusion:
The proposed approach, SolBERT, can serve as a reliable and powerful smart contract encoder, better capturing the syntactic and semantic aspects of the Solidity code. The results and findings also validate the effectiveness and positive synergistic effect of SolBERT’s encoding optimization operations.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.