Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks

2017 Intelligent Systems Conference (IntelliSys) Pub Date : 2017-09-01 DOI:10.1109/INTELLISYS.2017.8324365

G. Velentzas, C. Tzafestas, M. Khamassi

{"title":"Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks","authors":"G. Velentzas, C. Tzafestas, M. Khamassi","doi":"10.1109/INTELLISYS.2017.8324365","DOIUrl":null,"url":null,"abstract":"Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal — bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios.","PeriodicalId":131825,"journal":{"name":"2017 Intelligent Systems Conference (IntelliSys)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Intelligent Systems Conference (IntelliSys)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELLISYS.2017.8324365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal — bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios.

查看原文本刊更多论文

非平稳多武装盗匪任务中主动探索的生物启发元学习

对环境变化的快速适应要求代理(动物、机器人和模拟人工制品)能够在学习过程中动态地调整探索与利用之间的权衡。这种权衡通常决定了相对于探索性选择(游戏邦注:即测试其他现在看起来最糟糕但之后可能变得有希望的行动)而言，剥削性选择(即选择在特定时刻主观表现为最佳的行动)的固定比例。机器学习领域的非平稳多臂强盗方法不是使用固定比例，而是证明了探索长时间未测试的动作等原则可以使性能更接近于最优边界后悔。与此同时，机器人学习和学习与决策计算神经科学领域的主动探索研究已经提出了替代解决方案，例如在平均表现下降时瞬态增加探索，或者将探索奖励专门用于与高不确定性相关的行动，以便在选择时获得信息。在这项工作中，我们比较了机器学习、计算神经科学和机器人学习在一组非平稳随机多臂强盗任务上的不同方法:突变;最好的强盗变成最坏的强盗，反之亦然;多重移位频率。我们发现不同的方法适用于不同的场景。我们提出了一种新的混合方法，结合了生物启发的元学习、卡尔曼滤波和探索奖励，并表明它在这些场景中优于其他方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 Intelligent Systems Conference (IntelliSys)

自引率

0.00%

发文量