Investigating the bugs in reinforcement learning programs: Insights from Stack Overflow and GitHub

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-09-23 DOI:10.1007/s10515-025-00555-z

Jiayin Song, Yike Li, Yunzhe Tian, Haoxuan Ma, Honglei Li, Jie Zuo, Jiqiang Liu, Wenjia Niu

{"title":"Investigating the bugs in reinforcement learning programs: Insights from Stack Overflow and GitHub","authors":"Jiayin Song, Yike Li, Yunzhe Tian, Haoxuan Ma, Honglei Li, Jie Zuo, Jiqiang Liu, Wenjia Niu","doi":"10.1007/s10515-025-00555-z","DOIUrl":null,"url":null,"abstract":"<div><p>Reinforcement learning (RL) is increasingly applied in areas such as gaming, robotic control, and autonomous driving. Like to deep learning, RL systems also encounter failures during operation. However, RL differs from deep learning in terms of its error causes and symptom manifestations. What are the differences in error causes and symptoms between RL and deep learning? How are RL errors and their symptoms related? Understanding the symptoms and causes of RL failures can advance research on RL failure detection and repair. In this paper, we conducted a comprehensive empirical study by collecting 1,155 error reports from the popular Q&A forum <i>Stack Overflow</i> and four <i>GitHub</i> repositories: baselines, stable-baselines3, tianshou and keras-rl. We analyzed the root causes and symptoms of these failures and examined the differences in resolution times across various root causes. Additionally, we analyzed the correlations between causes and symptoms. Our study yielded 14 key findings, and six implications for developing RL detection and failure repair tools. Our work is the first to integrate LLM-based analysis with manual validation for RL bug studies, providing actionable insights for tool development and testing strategies.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00555-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning (RL) is increasingly applied in areas such as gaming, robotic control, and autonomous driving. Like to deep learning, RL systems also encounter failures during operation. However, RL differs from deep learning in terms of its error causes and symptom manifestations. What are the differences in error causes and symptoms between RL and deep learning? How are RL errors and their symptoms related? Understanding the symptoms and causes of RL failures can advance research on RL failure detection and repair. In this paper, we conducted a comprehensive empirical study by collecting 1,155 error reports from the popular Q&A forum Stack Overflow and four GitHub repositories: baselines, stable-baselines3, tianshou and keras-rl. We analyzed the root causes and symptoms of these failures and examined the differences in resolution times across various root causes. Additionally, we analyzed the correlations between causes and symptoms. Our study yielded 14 key findings, and six implications for developing RL detection and failure repair tools. Our work is the first to integrate LLM-based analysis with manual validation for RL bug studies, providing actionable insights for tool development and testing strategies.

Abstract Image

查看原文本刊更多论文

调查强化学习程序中的bug：来自Stack Overflow和GitHub的见解

强化学习（RL）越来越多地应用于游戏、机器人控制和自动驾驶等领域。与深度学习一样，强化学习系统在运行过程中也会遇到故障。然而，强化学习的错误原因和症状表现与深度学习不同。强化学习和深度学习在错误原因和症状上有什么不同？RL错误及其症状是如何相关的？了解RL故障的症状和原因可以促进RL故障检测和修复的研究。在本文中，我们通过收集来自热门Q&； a论坛Stack Overflow和四个GitHub存储库（baselines, stable-baselines3， tianshou和keras-rl）的1155份错误报告进行了全面的实证研究。我们分析了这些故障的根本原因和症状，并检查了各种根本原因在解决时间上的差异。此外，我们还分析了病因与症状之间的相关性。我们的研究产生了14个关键发现，以及开发RL检测和故障修复工具的6个含义。我们的工作是第一个将基于llm的分析与RL错误研究的手动验证集成在一起，为工具开发和测试策略提供可操作的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.