Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks

G. Velentzas, C. Tzafestas, M. Khamassi
{"title":"Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks","authors":"G. Velentzas, C. Tzafestas, M. Khamassi","doi":"10.1109/INTELLISYS.2017.8324365","DOIUrl":null,"url":null,"abstract":"Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal — bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios.","PeriodicalId":131825,"journal":{"name":"2017 Intelligent Systems Conference (IntelliSys)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Intelligent Systems Conference (IntelliSys)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELLISYS.2017.8324365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal — bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios.
非平稳多武装盗匪任务中主动探索的生物启发元学习
对环境变化的快速适应要求代理(动物、机器人和模拟人工制品)能够在学习过程中动态地调整探索与利用之间的权衡。这种权衡通常决定了相对于探索性选择(游戏邦注:即测试其他现在看起来最糟糕但之后可能变得有希望的行动)而言,剥削性选择(即选择在特定时刻主观表现为最佳的行动)的固定比例。机器学习领域的非平稳多臂强盗方法不是使用固定比例,而是证明了探索长时间未测试的动作等原则可以使性能更接近于最优边界后悔。与此同时,机器人学习和学习与决策计算神经科学领域的主动探索研究已经提出了替代解决方案,例如在平均表现下降时瞬态增加探索,或者将探索奖励专门用于与高不确定性相关的行动,以便在选择时获得信息。在这项工作中,我们比较了机器学习、计算神经科学和机器人学习在一组非平稳随机多臂强盗任务上的不同方法:突变;最好的强盗变成最坏的强盗,反之亦然;多重移位频率。我们发现不同的方法适用于不同的场景。我们提出了一种新的混合方法,结合了生物启发的元学习、卡尔曼滤波和探索奖励,并表明它在这些场景中优于其他方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信