双DQN收敛性与奖励稀疏性的实证分析

Samuel Blad, Martin Längkvist, Franziska Klügl-Frohnmeyer, A. Loutfi
{"title":"双DQN收敛性与奖励稀疏性的实证分析","authors":"Samuel Blad, Martin Längkvist, Franziska Klügl-Frohnmeyer, A. Loutfi","doi":"10.1109/ICMLA55696.2022.00102","DOIUrl":null,"url":null,"abstract":"Q-Networks are used in Reinforcement Learning to model the expected return from every action at a given state. When training Q-Networks, external reward signals are propagated to the previously performed actions leading up to each reward. If many actions are required before experiencing a reward, the reward signal is distributed across all those actions, where some actions may have greater impact on the reward than others. As the number of significant actions between rewards increases, the relative importance of each action decreases. If actions have too small importance, their impact might be overshadowed by noise in a deep neural network model, potentially causing convergence issues. In this work, we empirically test the limits of increasing the number of actions leading up to a reward in a simple grid-world environment. We show in our experiments that even though the training error surpasses the reward signal attributed to each action, the model is still able to learn a smooth enough value representation.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Empirical analysis of the convergence of Double DQN in relation to reward sparsity\",\"authors\":\"Samuel Blad, Martin Längkvist, Franziska Klügl-Frohnmeyer, A. Loutfi\",\"doi\":\"10.1109/ICMLA55696.2022.00102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Q-Networks are used in Reinforcement Learning to model the expected return from every action at a given state. When training Q-Networks, external reward signals are propagated to the previously performed actions leading up to each reward. If many actions are required before experiencing a reward, the reward signal is distributed across all those actions, where some actions may have greater impact on the reward than others. As the number of significant actions between rewards increases, the relative importance of each action decreases. If actions have too small importance, their impact might be overshadowed by noise in a deep neural network model, potentially causing convergence issues. In this work, we empirically test the limits of increasing the number of actions leading up to a reward in a simple grid-world environment. We show in our experiments that even though the training error surpasses the reward signal attributed to each action, the model is still able to learn a smooth enough value representation.\",\"PeriodicalId\":128160,\"journal\":{\"name\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA55696.2022.00102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

Q-Networks在强化学习中用于模拟给定状态下每个动作的预期回报。当训练q网络时,外部奖励信号被传播到导致每个奖励的先前执行的动作。如果在体验奖励之前需要许多行动,那么奖励信号就会分布在所有这些行动中,其中一些行动可能比其他行动对奖励的影响更大。当奖励之间的重要行动数量增加时,每个行动的相对重要性就会降低。如果动作的重要性太小,它们的影响可能会被深度神经网络模型中的噪声所掩盖,从而可能导致收敛问题。在这项工作中,我们通过经验测试了在简单的网格世界环境中增加导致奖励的行动数量的限制。我们在实验中表明,即使训练误差超过了归因于每个动作的奖励信号,模型仍然能够学习到足够平滑的值表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Empirical analysis of the convergence of Double DQN in relation to reward sparsity
Q-Networks are used in Reinforcement Learning to model the expected return from every action at a given state. When training Q-Networks, external reward signals are propagated to the previously performed actions leading up to each reward. If many actions are required before experiencing a reward, the reward signal is distributed across all those actions, where some actions may have greater impact on the reward than others. As the number of significant actions between rewards increases, the relative importance of each action decreases. If actions have too small importance, their impact might be overshadowed by noise in a deep neural network model, potentially causing convergence issues. In this work, we empirically test the limits of increasing the number of actions leading up to a reward in a simple grid-world environment. We show in our experiments that even though the training error surpasses the reward signal attributed to each action, the model is still able to learn a smooth enough value representation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信