离线强化学习中的超参数调优

Andrew Tittaferrante, A. Yassine
{"title":"离线强化学习中的超参数调优","authors":"Andrew Tittaferrante, A. Yassine","doi":"10.1109/ICMLA55696.2022.00101","DOIUrl":null,"url":null,"abstract":"In this work, we propose a reliable hyperparameter tuning scheme for offline reinforcement learning. We demonstrate our proposed scheme using the simplest antmaze environment from the standard benchmark offline dataset, D4RL. The usual approach for policy evaluation in offline reinforcement learning involves online evaluation, i.e., cherry-picking best performance on the test environment. To mitigate this cherry-picking, we propose an ad-hoc online evaluation metric, which we name \"median-median-return\". This metric enables more reliable reporting of results because it represents the expected performance of the learned policy by taking the median online evaluation performance across both epochs and training runs. To demonstrate our scheme, we employ the recently state-of-the-art algorithm, IQL, and perform a thorough hyperparameter search based on our proposed metric. The tuned architectures enjoy notably stronger cherry-picked performance, and the best models are able to surpass the reported state-of-the-art performance on average.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hyperparameter Tuning in Offline Reinforcement Learning\",\"authors\":\"Andrew Tittaferrante, A. Yassine\",\"doi\":\"10.1109/ICMLA55696.2022.00101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we propose a reliable hyperparameter tuning scheme for offline reinforcement learning. We demonstrate our proposed scheme using the simplest antmaze environment from the standard benchmark offline dataset, D4RL. The usual approach for policy evaluation in offline reinforcement learning involves online evaluation, i.e., cherry-picking best performance on the test environment. To mitigate this cherry-picking, we propose an ad-hoc online evaluation metric, which we name \\\"median-median-return\\\". This metric enables more reliable reporting of results because it represents the expected performance of the learned policy by taking the median online evaluation performance across both epochs and training runs. To demonstrate our scheme, we employ the recently state-of-the-art algorithm, IQL, and perform a thorough hyperparameter search based on our proposed metric. The tuned architectures enjoy notably stronger cherry-picked performance, and the best models are able to surpass the reported state-of-the-art performance on average.\",\"PeriodicalId\":128160,\"journal\":{\"name\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA55696.2022.00101\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在这项工作中,我们提出了一种可靠的超参数调谐方案用于离线强化学习。我们使用来自标准基准离线数据集D4RL的最简单的antmaze环境来演示我们提出的方案。离线强化学习中策略评估的常用方法包括在线评估,即在测试环境中挑选最佳性能。为了减轻这种挑选,我们提出了一个特别的在线评估指标,我们将其命名为“中位数回报”。这个度量可以更可靠地报告结果,因为它通过在两个时间段和训练运行中获取在线评估性能的中位数来表示学习策略的预期性能。为了演示我们的方案,我们采用了最近最先进的算法IQL,并基于我们提出的度量执行了彻底的超参数搜索。调优的体系结构享有明显更强的精选性能,并且最好的模型能够超过报告的最先进的平均性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hyperparameter Tuning in Offline Reinforcement Learning
In this work, we propose a reliable hyperparameter tuning scheme for offline reinforcement learning. We demonstrate our proposed scheme using the simplest antmaze environment from the standard benchmark offline dataset, D4RL. The usual approach for policy evaluation in offline reinforcement learning involves online evaluation, i.e., cherry-picking best performance on the test environment. To mitigate this cherry-picking, we propose an ad-hoc online evaluation metric, which we name "median-median-return". This metric enables more reliable reporting of results because it represents the expected performance of the learned policy by taking the median online evaluation performance across both epochs and training runs. To demonstrate our scheme, we employ the recently state-of-the-art algorithm, IQL, and perform a thorough hyperparameter search based on our proposed metric. The tuned architectures enjoy notably stronger cherry-picked performance, and the best models are able to surpass the reported state-of-the-art performance on average.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信