通过Maxmean和Aitken值迭代减少强化学习中的估计偏差和方差

IF 8 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2025-10-04 DOI:10.1016/j.engappai.2025.112502

Fanghui Huang , Wenqi Han , Xiang Li , Xinyang Deng , Wen Jiang

{"title":"通过Maxmean和Aitken值迭代减少强化学习中的估计偏差和方差","authors":"Fanghui Huang , Wenqi Han , Xiang Li , Xinyang Deng , Wen Jiang","doi":"10.1016/j.engappai.2025.112502","DOIUrl":null,"url":null,"abstract":"<div><div>The value-based reinforcement leaning methods suffer from overestimation bias, because of the existence of max operator, resulting in suboptimal policies. Meanwhile, variance in value estimation will cause the instability of networks. Many algorithms have been presented to solve the mentioned, but these lack the theoretical analysis about the degree of estimation bias, and the trade-off between the estimation bias and variance. Motivated by the above, in this paper, we propose a novel method based on Maxmean and Aitken value iteration, named MMAVI. The Maxmean operation allows the average of multiple state–action values (Q values) to be used as the estimated target value to mitigate the bias and variance. The Aitken value iteration is used to update Q values and improve the convergence rate. Based on the proposed method, combined with Q-learning and deep Q-network, we design two novel algorithms to adapt to different environments. To understand the effect of MMAVI, we analyze it both theoretically and empirically. In theory, we derive the closed-form expressions of reducing bias and variance, and prove that the convergence rate of our proposed method is faster than the traditional methods with Bellman equation. In addition, the convergence of our algorithms is proved in a tabular setting. Finally, we demonstrate that our proposed algorithms outperform the state-of-the-art algorithms in several environments.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"162 ","pages":"Article 112502"},"PeriodicalIF":8.0000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reducing the estimation bias and variance in reinforcement learning via Maxmean and Aitken value iteration\",\"authors\":\"Fanghui Huang , Wenqi Han , Xiang Li , Xinyang Deng , Wen Jiang\",\"doi\":\"10.1016/j.engappai.2025.112502\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The value-based reinforcement leaning methods suffer from overestimation bias, because of the existence of max operator, resulting in suboptimal policies. Meanwhile, variance in value estimation will cause the instability of networks. Many algorithms have been presented to solve the mentioned, but these lack the theoretical analysis about the degree of estimation bias, and the trade-off between the estimation bias and variance. Motivated by the above, in this paper, we propose a novel method based on Maxmean and Aitken value iteration, named MMAVI. The Maxmean operation allows the average of multiple state–action values (Q values) to be used as the estimated target value to mitigate the bias and variance. The Aitken value iteration is used to update Q values and improve the convergence rate. Based on the proposed method, combined with Q-learning and deep Q-network, we design two novel algorithms to adapt to different environments. To understand the effect of MMAVI, we analyze it both theoretically and empirically. In theory, we derive the closed-form expressions of reducing bias and variance, and prove that the convergence rate of our proposed method is faster than the traditional methods with Bellman equation. In addition, the convergence of our algorithms is proved in a tabular setting. Finally, we demonstrate that our proposed algorithms outperform the state-of-the-art algorithms in several environments.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"162 \",\"pages\":\"Article 112502\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625025333\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625025333","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

基于值的强化学习方法由于存在最大算子，存在高估偏差，导致策略次优。同时，估计值的方差会导致网络的不稳定。已有许多算法解决了上述问题，但缺乏对估计偏差程度的理论分析，以及估计偏差与方差之间的权衡。基于此，本文提出了一种基于Maxmean和Aitken值迭代的新方法——MMAVI。Maxmean操作允许使用多个状态-动作值（Q值）的平均值作为估计的目标值，以减轻偏差和方差。采用艾特肯值迭代法更新Q值，提高收敛速度。在此基础上，结合q学习和深度q网络，设计了两种适应不同环境的新算法。为了理解MMAVI的作用，我们从理论和实证两个方面进行了分析。在理论上，我们推导出了减少偏差和方差的封闭表达式，并证明了该方法的收敛速度比传统的Bellman方程方法快。此外，在表格设置下证明了算法的收敛性。最后，我们证明了我们提出的算法在几个环境中优于最先进的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reducing the estimation bias and variance in reinforcement learning via Maxmean and Aitken value iteration

The value-based reinforcement leaning methods suffer from overestimation bias, because of the existence of max operator, resulting in suboptimal policies. Meanwhile, variance in value estimation will cause the instability of networks. Many algorithms have been presented to solve the mentioned, but these lack the theoretical analysis about the degree of estimation bias, and the trade-off between the estimation bias and variance. Motivated by the above, in this paper, we propose a novel method based on Maxmean and Aitken value iteration, named MMAVI. The Maxmean operation allows the average of multiple state–action values (Q values) to be used as the estimated target value to mitigate the bias and variance. The Aitken value iteration is used to update Q values and improve the convergence rate. Based on the proposed method, combined with Q-learning and deep Q-network, we design two novel algorithms to adapt to different environments. To understand the effect of MMAVI, we analyze it both theoretically and empirically. In theory, we derive the closed-form expressions of reducing bias and variance, and prove that the convergence rate of our proposed method is faster than the traditional methods with Bellman equation. In addition, the convergence of our algorithms is proved in a tabular setting. Finally, we demonstrate that our proposed algorithms outperform the state-of-the-art algorithms in several environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.