RL感知器:高维策略学习的泛化动力学

IF 11.6 1区 物理与天体物理 Q1 PHYSICS, MULTIDISCIPLINARY
Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe
{"title":"RL感知器:高维策略学习的泛化动力学","authors":"Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe","doi":"10.1103/physrevx.15.021051","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) algorithms have transformed many domains of machine learning. To tackle real-world problems, RL often relies on neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, many theories of RL have focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional RL model that can capture a variety of learning protocols, and we derive its typical policy learning dynamics as a set of closed-form ordinary differential equations. We obtain optimal schedules for the learning rates and task difficulty—analogous to annealing schemes and curricula during training in RL—and show that the model exhibits rich behavior, including delayed learning under sparse rewards, a variety of learning regimes depending on reward baselines, and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game “Bossfight” and Arcade Learning Environment game “Pong” also show such a speed-accuracy trade-off in practice. Together, these results take a step toward closing the gap between theory and practice in high-dimensional RL. <jats:supplementary-material> <jats:copyright-statement>Published by the American Physical Society</jats:copyright-statement> <jats:copyright-year>2025</jats:copyright-year> </jats:permissions> </jats:supplementary-material>","PeriodicalId":20161,"journal":{"name":"Physical Review X","volume":"96 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RL Perceptron: Generalization Dynamics of Policy Learning in High Dimensions\",\"authors\":\"Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe\",\"doi\":\"10.1103/physrevx.15.021051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning (RL) algorithms have transformed many domains of machine learning. To tackle real-world problems, RL often relies on neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, many theories of RL have focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional RL model that can capture a variety of learning protocols, and we derive its typical policy learning dynamics as a set of closed-form ordinary differential equations. We obtain optimal schedules for the learning rates and task difficulty—analogous to annealing schemes and curricula during training in RL—and show that the model exhibits rich behavior, including delayed learning under sparse rewards, a variety of learning regimes depending on reward baselines, and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game “Bossfight” and Arcade Learning Environment game “Pong” also show such a speed-accuracy trade-off in practice. Together, these results take a step toward closing the gap between theory and practice in high-dimensional RL. <jats:supplementary-material> <jats:copyright-statement>Published by the American Physical Society</jats:copyright-statement> <jats:copyright-year>2025</jats:copyright-year> </jats:permissions> </jats:supplementary-material>\",\"PeriodicalId\":20161,\"journal\":{\"name\":\"Physical Review X\",\"volume\":\"96 1\",\"pages\":\"\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2025-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Physical Review X\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.1103/physrevx.15.021051\",\"RegionNum\":1,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PHYSICS, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physical Review X","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1103/physrevx.15.021051","RegionNum":1,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

强化学习(RL)算法已经改变了机器学习的许多领域。为了解决现实世界的问题,强化学习通常依赖神经网络直接从像素或其他高维感官输入中学习策略。相比之下,强化学习的许多理论都集中在离散状态空间或最坏情况分析上,关于高维环境下策略学习动态的基本问题仍然存在。在这里,我们提出了一个可解的高维强化学习模型,该模型可以捕获各种学习协议,并将其典型的策略学习动力学推导为一组封闭形式的常微分方程。我们获得了学习率和任务难度的最优调度-类似于强化学习中的退火方案和课程-并表明该模型表现出丰富的行为,包括稀疏奖励下的延迟学习,取决于奖励基线的各种学习制度,以及由奖励严格性驱动的速度-精度权衡。在Procgen游戏“Bossfight”和Arcade Learning Environment游戏“Pong”的变体上进行的实验也显示出这种速度和准确性之间的权衡。总之,这些结果朝着缩小高维强化学习理论与实践之间的差距迈出了一步。2025年由美国物理学会出版
本文章由计算机程序翻译,如有差异,请以英文原文为准。
RL Perceptron: Generalization Dynamics of Policy Learning in High Dimensions
Reinforcement learning (RL) algorithms have transformed many domains of machine learning. To tackle real-world problems, RL often relies on neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, many theories of RL have focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional RL model that can capture a variety of learning protocols, and we derive its typical policy learning dynamics as a set of closed-form ordinary differential equations. We obtain optimal schedules for the learning rates and task difficulty—analogous to annealing schemes and curricula during training in RL—and show that the model exhibits rich behavior, including delayed learning under sparse rewards, a variety of learning regimes depending on reward baselines, and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game “Bossfight” and Arcade Learning Environment game “Pong” also show such a speed-accuracy trade-off in practice. Together, these results take a step toward closing the gap between theory and practice in high-dimensional RL. Published by the American Physical Society 2025
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Physical Review X
Physical Review X PHYSICS, MULTIDISCIPLINARY-
CiteScore
24.60
自引率
1.60%
发文量
197
审稿时长
3 months
期刊介绍: Physical Review X (PRX) stands as an exclusively online, fully open-access journal, emphasizing innovation, quality, and enduring impact in the scientific content it disseminates. Devoted to showcasing a curated selection of papers from pure, applied, and interdisciplinary physics, PRX aims to feature work with the potential to shape current and future research while leaving a lasting and profound impact in their respective fields. Encompassing the entire spectrum of physics subject areas, PRX places a special focus on groundbreaking interdisciplinary research with broad-reaching influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信