Fine-Tuning Pre-Trained CodeBERT for Code Search in Smart Contract

Pub Date : 2023-06-01 DOI:10.1051/wujns/2023283237

Huan Jin, Qinying Li

{"title":"Fine-Tuning Pre-Trained CodeBERT for Code Search in Smart Contract","authors":"Huan Jin, Qinying Li","doi":"10.1051/wujns/2023283237","DOIUrl":null,"url":null,"abstract":"Smart contracts, which automatically execute on decentralized platforms like Ethereum, require high security and low gas consumption. As a result, developers have a strong demand for semantic code search tools that utilize natural language queries to efficiently search for existing code snippets. However, existing code search models face a semantic gap between code and queries, which requires a large amount of training data. In this paper, we propose a fine-tuning approach to bridge the semantic gap in code search and improve the search accuracy. We collect 80 723 different pairs of from Etherscan.io and use these pairs to fine-tune, validate, and test the pre-trained CodeBERT model. Using the fine-tuned model, we develop a code search engine specifically for smart contracts. We evaluate the Recall@k and Mean Reciprocal Rank (MRR) of the fine-tuned CodeBERT model using different proportions of the fine-tuned data. It is encouraging that even a small amount of fine-tuned data can produce satisfactory results. In addition, we perform a comparative analysis between the fine-tuned CodeBERT model and the two state-of-the-art models. The experimental results show that the fine-tuned CodeBERT model has superior performance in terms of Recall@k and MRR. These findings highlight the effectiveness of our fine-tuning approach and its potential to significantly improve the code search accuracy.","PeriodicalId":56925,"journal":{"name":"","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.1051/wujns/2023283237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Smart contracts, which automatically execute on decentralized platforms like Ethereum, require high security and low gas consumption. As a result, developers have a strong demand for semantic code search tools that utilize natural language queries to efficiently search for existing code snippets. However, existing code search models face a semantic gap between code and queries, which requires a large amount of training data. In this paper, we propose a fine-tuning approach to bridge the semantic gap in code search and improve the search accuracy. We collect 80 723 different pairs of from Etherscan.io and use these pairs to fine-tune, validate, and test the pre-trained CodeBERT model. Using the fine-tuned model, we develop a code search engine specifically for smart contracts. We evaluate the Recall@k and Mean Reciprocal Rank (MRR) of the fine-tuned CodeBERT model using different proportions of the fine-tuned data. It is encouraging that even a small amount of fine-tuned data can produce satisfactory results. In addition, we perform a comparative analysis between the fine-tuned CodeBERT model and the two state-of-the-art models. The experimental results show that the fine-tuned CodeBERT model has superior performance in terms of Recall@k and MRR. These findings highlight the effectiveness of our fine-tuning approach and its potential to significantly improve the code search accuracy.

查看原文本刊更多论文

用于智能合约中代码搜索的预训练CodeBERT微调

智能合约在以太坊等去中心化平台上自动执行，需要高安全性和低油耗。因此，开发人员对利用自然语言查询高效搜索现有代码片段的语义代码搜索工具有着强烈的需求。然而，现有的代码搜索模型面临着代码和查询之间的语义差距，这需要大量的训练数据。在本文中，我们提出了一种微调方法来弥合代码搜索中的语义差距，提高搜索精度。我们从Etherscan.io收集了80223对不同的，并使用这些对来微调、验证和测试预训练的CodeBERT模型。使用微调模型，我们开发了一个专门用于智能合约的代码搜索引擎。我们评估Recall@k以及使用不同比例的微调数据的微调CodeBERT模型的平均倒数排名（MRR）。令人鼓舞的是，即使是少量的微调数据也能产生令人满意的结果。此外，我们还对微调后的CodeBERT模型和两个最先进的模型进行了比较分析。实验结果表明，微调后的CodeBERT模型在Recall@k和MRR。这些发现突出了我们微调方法的有效性及其显著提高代码搜索准确性的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文