合理的还是快速的代码?揭示深谋实发展的优势与局限

IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Gavina Baralla, Giacomo Ibba, Roberto Tonelli
{"title":"合理的还是快速的代码?揭示深谋实发展的优势与局限","authors":"Gavina Baralla,&nbsp;Giacomo Ibba,&nbsp;Roberto Tonelli","doi":"10.1016/j.infsof.2025.107917","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.</div></div><div><h3>Objective:</h3><div>This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.</div></div><div><h3>Methods:</h3><div>We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.</div></div><div><h3>Results:</h3><div>Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.</div></div><div><h3>Conclusions:</h3><div>Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"189 ","pages":"Article 107917"},"PeriodicalIF":4.3000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development\",\"authors\":\"Gavina Baralla,&nbsp;Giacomo Ibba,&nbsp;Roberto Tonelli\",\"doi\":\"10.1016/j.infsof.2025.107917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.</div></div><div><h3>Objective:</h3><div>This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.</div></div><div><h3>Methods:</h3><div>We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.</div></div><div><h3>Results:</h3><div>Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.</div></div><div><h3>Conclusions:</h3><div>Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"189 \",\"pages\":\"Article 107917\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925002563\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002563","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

背景:随着区块链系统越来越复杂,安全高效的智能合约开发仍然是一个关键挑战。像DeepSeek这样的大型语言模型(llm)承诺通过自动代码生成、调试和测试来显著提高开发人员的工作效率。这项研究的重点是Solidity,这是以太坊智能合约的主要语言,其中正确性、气体效率和安全性对现实世界的采用至关重要。目的:本研究评估了DeepSeek的V3和R1模型(一种非推理混合专家架构和一种通过强化学习训练的基于推理的模型)在自动化Solidity合约生成和测试以及识别和修复常见漏洞方面的能力。方法:我们设计了一个受控的实验框架,通过生成和分析各种智能合约来评估这两种模型,包括标准化代币(ERC20、ERC721、ERC1155)和现实世界的应用场景(供应链、代币交换、拍卖)。评估基于一个多维度量套件,涵盖质量、技术健壮性和过程特征。漏洞检测和修补功能使用预定义的漏洞契约和引导补丁提示进行测试。该分析跨越了提示复杂性的六个层次,并比较了基于推理和非基于推理的生成策略的影响。结果:研究结果表明,R1在高复杂性下提供更准确和优化的输出,而V3在更简单的任务和更简单的代码结构中执行得更一致。然而,这两种模型都表现出持续的幻觉,脆弱性覆盖范围的局限性,以及由于及时制定而产生的不一致性。重新评估模式和输出质量之间的相关性表明,推理有助于复杂的场景,尽管过度的修订可能导致过度设计或不稳定的解决方案。结论:这两种模型都不够强大,无法在复杂或安全关键的场景中自主生成无问题的智能合约,这强调了人类监督的必要性。这些发现突出了将llm集成到区块链开发工作流中的最佳实践,并强调了将模型选择与任务复杂性和安全性要求相一致的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development

Context:

As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.

Objective:

This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.

Methods:

We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.

Results:

Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.

Conclusions:

Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information and Software Technology
Information and Software Technology 工程技术-计算机:软件工程
CiteScore
9.10
自引率
7.70%
发文量
164
审稿时长
9.6 weeks
期刊介绍: Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信