SolBERT: Advancing solidity smart contract similarity analysis via self-supervised pre-training and contrastive fine-tuning

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-05-13 DOI:10.1016/j.infsof.2025.107766

Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen

{"title":"SolBERT: Advancing solidity smart contract similarity analysis via self-supervised pre-training and contrastive fine-tuning","authors":"Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen","doi":"10.1016/j.infsof.2025.107766","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Reliable and effective similarity analysis for the smart contracts facilitates the maintenance and quality assurance of the smart contract ecosystem. However, existing signature-based methods and code representation learning-based methods suffer from limitations such as heavy-weight program analysis payloads or suboptimal contract encodings.</div></div><div><h3>Objective:</h3><div>This paper aims to design a fully unsupervised language model for better capturing the syntactic and semantic richness of Solidity code, and utilizes it for advancing the effectiveness of smart contract similarity analysis.</div></div><div><h3>Methods:</h3><div>Inspired by the impressive semantic learning capability of pre-trained language models (PLMs), we propose SolBERT, a PLM specifically tailored for enhancing Solidity smart contracts similarity detection. To ensure it produces high-quality encodings, SolBERT leverages BERT-style pre-training with the masked language modeling (MLM) and token type prediction (TTP) tasks applied on code-structure-aware token sequences derived from the contracts’ abstract syntax trees (ASTs) through structure-retaining tree linearization and light-weight normalization to learn a base model. On this basis, self-supervised contrastive fine-tuning and unsupervised whitening operations are further performed to optimize contract encoding generation.</div></div><div><h3>Results:</h3><div>Experiments are conducted on three contract similarity-related tasks, including contract clone detection, bug detection, and code clustering. The results indicate that SolBERT significantly outperforms state-of-the-art approaches with average absolute gains of 21.33% and 21.50% in terms of F1, and 17.78% and 26.60% in terms of accuracy for the clone detection and bug detection tasks, respectively; and an average absolute gain of 17.97% for code clustering task. When applying both contrastive fine-tuning and whitening optimizations, SolBERT also shows superior performance than the case of lacking any of them.</div></div><div><h3>Conclusion:</h3><div>The proposed approach, SolBERT, can serve as a reliable and powerful smart contract encoder, better capturing the syntactic and semantic aspects of the Solidity code. The results and findings also validate the effectiveness and positive synergistic effect of SolBERT’s encoding optimization operations.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107766"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001053","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Reliable and effective similarity analysis for the smart contracts facilitates the maintenance and quality assurance of the smart contract ecosystem. However, existing signature-based methods and code representation learning-based methods suffer from limitations such as heavy-weight program analysis payloads or suboptimal contract encodings.

Objective:

This paper aims to design a fully unsupervised language model for better capturing the syntactic and semantic richness of Solidity code, and utilizes it for advancing the effectiveness of smart contract similarity analysis.

Methods:

Inspired by the impressive semantic learning capability of pre-trained language models (PLMs), we propose SolBERT, a PLM specifically tailored for enhancing Solidity smart contracts similarity detection. To ensure it produces high-quality encodings, SolBERT leverages BERT-style pre-training with the masked language modeling (MLM) and token type prediction (TTP) tasks applied on code-structure-aware token sequences derived from the contracts’ abstract syntax trees (ASTs) through structure-retaining tree linearization and light-weight normalization to learn a base model. On this basis, self-supervised contrastive fine-tuning and unsupervised whitening operations are further performed to optimize contract encoding generation.

Results:

Experiments are conducted on three contract similarity-related tasks, including contract clone detection, bug detection, and code clustering. The results indicate that SolBERT significantly outperforms state-of-the-art approaches with average absolute gains of 21.33% and 21.50% in terms of F1, and 17.78% and 26.60% in terms of accuracy for the clone detection and bug detection tasks, respectively; and an average absolute gain of 17.97% for code clustering task. When applying both contrastive fine-tuning and whitening optimizations, SolBERT also shows superior performance than the case of lacking any of them.

Conclusion:

The proposed approach, SolBERT, can serve as a reliable and powerful smart contract encoder, better capturing the syntactic and semantic aspects of the Solidity code. The results and findings also validate the effectiveness and positive synergistic effect of SolBERT’s encoding optimization operations.

查看原文本刊更多论文

SolBERT：通过自我监督的预训练和对比微调，推进可靠性智能合约相似性分析

背景：对智能合约进行可靠有效的相似性分析，有利于智能合约生态系统的维护和质量保证。然而，现有的基于签名的方法和基于代码表示学习的方法都受到诸如重量级程序分析有效载荷或次优契约编码等限制。目的：设计一个完全无监督的语言模型，更好地捕捉Solidity代码的语法和语义丰富性，并利用它来提高智能合约相似度分析的有效性。方法：受预训练语言模型（PLM）令人印象深刻的语义学习能力的启发，我们提出了SolBERT，这是一个专门为增强Solidity智能合约相似性检测而量身定制的PLM。为了确保产生高质量的编码，SolBERT利用bert风格的预训练，使用掩码语言建模（MLM）和标记类型预测（TTP）任务，这些任务应用于从合同的抽象语法树（ast）派生的代码结构感知标记序列，通过保留结构的树线性化和轻量级归一化来学习基本模型。在此基础上，进一步进行自监督对比微调和无监督白化操作，优化契约编码生成。结果：对合约克隆检测、bug检测、代码聚类三个合约相似度相关的任务进行了实验。结果表明，SolBERT算法在F1和bug检测任务上的平均绝对增益分别为21.33%和21.50%，准确率分别为17.78%和26.60%，显著优于目前最先进的方法；对于代码聚类任务，平均绝对增益为17.97%。当应用对比微调和美白优化时，SolBERT也显示出比缺乏任何一种优化的情况下更好的性能。结论：提出的方法SolBERT可以作为一个可靠而强大的智能合约编码器，更好地捕获Solidity代码的语法和语义方面。结果和发现也验证了SolBERT编码优化操作的有效性和积极的协同效应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.