llm和堆栈溢出讨论：可靠性、影响和挑战

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-07-03 DOI:10.1016/j.jss.2025.112541

Leuson Da Silva , Jordan Samhi , Foutse Khomh

{"title":"llm和堆栈溢出讨论：可靠性、影响和挑战","authors":"Leuson Da Silva , Jordan Samhi , Foutse Khomh","doi":"10.1016/j.jss.2025.112541","DOIUrl":null,"url":null,"abstract":"<div><div>Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers’ queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT’s release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: <em>the race was on</em>. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to quantify the reliability of LLMs’ answers and their potential to replace Stack Overflow in the long term; identify and understand why LLMs fail; measure users’ activity evolution with Stack Overflow over time; and compare LLMs together. Our empirical results are unequivocal: <em>ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains</em>, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs and provide guidelines for future challenges faced by users and researchers.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112541"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLMs and Stack Overflow discussions: Reliability, impact, and challenges\",\"authors\":\"Leuson Da Silva , Jordan Samhi , Foutse Khomh\",\"doi\":\"10.1016/j.jss.2025.112541\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers’ queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT’s release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: <em>the race was on</em>. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to quantify the reliability of LLMs’ answers and their potential to replace Stack Overflow in the long term; identify and understand why LLMs fail; measure users’ activity evolution with Stack Overflow over time; and compare LLMs together. Our empirical results are unequivocal: <em>ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains</em>, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs and provide guidelines for future challenges faced by users and researchers.</div></div>\",\"PeriodicalId\":51099,\"journal\":{\"name\":\"Journal of Systems and Software\",\"volume\":\"230 \",\"pages\":\"Article 112541\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems and Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0164121225002109\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002109","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

自2022年11月发布以来，ChatGPT已经撼动了Stack Overflow，这是开发人员查询编程和软件开发的首要平台。ChatGPT展示了对技术问题产生即时的、类似人类的回答的能力，在开发人员社区中引发了关于人类驱动平台在生成人工智能时代不断发展的角色的争论。ChatGPT发布两个月后，Meta用自己的大型语言模型（LLM） LLaMA给出了答案：竞赛开始了。我们进行了一项实证研究，分析了Stack Overflow中的问题，并使用这些llm来解决这些问题。通过这种方式，我们的目标是量化LLMs答案的可靠性，以及它们在长期内取代Stack Overflow的潜力；识别和理解llm失败的原因；通过Stack Overflow测量用户活动随时间的变化；并将llm进行比较。我们的实证结果是明确的：ChatGPT和LLaMA挑战了人类的专业知识，但在某些领域并没有超越它，同时观察到用户发帖活动显著下降。此外，我们还讨论了我们的发现对新llm的使用和开发的影响，并为用户和研究人员面临的未来挑战提供了指导方针。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LLMs and Stack Overflow discussions: Reliability, impact, and challenges

Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers’ queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT’s release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to quantify the reliability of LLMs’ answers and their potential to replace Stack Overflow in the long term; identify and understand why LLMs fail; measure users’ activity evolution with Stack Overflow over time; and compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs and provide guidelines for future challenges faced by users and researchers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.