{"title":"合理的还是快速的代码?揭示深谋实发展的优势与局限","authors":"Gavina Baralla, Giacomo Ibba, Roberto Tonelli","doi":"10.1016/j.infsof.2025.107917","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.</div></div><div><h3>Objective:</h3><div>This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.</div></div><div><h3>Methods:</h3><div>We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.</div></div><div><h3>Results:</h3><div>Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.</div></div><div><h3>Conclusions:</h3><div>Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"189 ","pages":"Article 107917"},"PeriodicalIF":4.3000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development\",\"authors\":\"Gavina Baralla, Giacomo Ibba, Roberto Tonelli\",\"doi\":\"10.1016/j.infsof.2025.107917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.</div></div><div><h3>Objective:</h3><div>This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.</div></div><div><h3>Methods:</h3><div>We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.</div></div><div><h3>Results:</h3><div>Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.</div></div><div><h3>Conclusions:</h3><div>Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"189 \",\"pages\":\"Article 107917\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584925002563\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002563","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Reasoned or Rapid code? Unveiling the strengths and limits of DeepSeek for Solidity development
Context:
As blockchain systems grow in complexity, secure and efficient smart contract development remains a crucial challenge. Large Language Models (LLMs) like DeepSeek promise significant enhancements in developer productivity through automated code generation, debugging, and testing. This study focuses on Solidity, the dominant language for Ethereum smart contracts, where correctness, gas efficiency, and security are critical to real-world adoption.
Objective:
This study evaluates the capabilities of DeepSeek’s V3 and R1 models, a non-reasoning Mixture-of-Experts architecture and a reasoning-based model trained via reinforcement learning, respectively, in automating Solidity contract generation and testing, as well as identifying and fixing common vulnerabilities.
Methods:
We designed a controlled experimental framework to evaluate both models by generating and analysing a diverse set of smart contracts, including standardised tokens (ERC20, ERC721, ERC1155) and real-world application scenarios (Supply Chain, Token Exchange, Auction). The evaluation is grounded on a multidimensional metric suite covering quality, technical robustness and process characteristics. Vulnerability detection and patching capabilities are tested using predefined vulnerable contracts and guided patch prompts. The analysis spans six levels of prompt complexity and compares the impact of reasoning-based and non-reasoning-based generation strategies.
Results:
Findings reveal that R1 delivers more accurate and optimised outputs under high complexity, while V3 performs more consistently in simpler tasks with simpler code structures. However, both models exhibit persistent hallucinations, limitations in vulnerability coverage, and inconsistencies due to prompt formulation. The correlation between re-evaluation patterns and output quality suggests that reasoning helps in complex scenarios, although excessive revisions may lead to over-engineered or unstable solutions.
Conclusions:
Neither model is robust enough to autonomously generate issue-free smart contracts in complex or security-critical scenarios, underscoring the need for human oversight. These findings highlight best practices for integrating LLMs into blockchain development workflows and emphasise the importance of aligning model selection with task complexity and security requirements.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.