Transfer learning for direct policy search: A reward shaping approach

2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL) Pub Date : 2013-11-04 DOI:10.1109/DEVLRN.2013.6652568

S. Doncieux

{"title":"Transfer learning for direct policy search: A reward shaping approach","authors":"S. Doncieux","doi":"10.1109/DEVLRN.2013.6652568","DOIUrl":null,"url":null,"abstract":"In the perspective of life long learning, a robot may face different, but related situations. Being able to exploit the knowledge acquired during a first learning phase may be critical in order to solve more complex tasks. This is the transfer learning problem. This problem is addressed here in the case of direct policy search algorithms. No discrete states, nor actions are defined a priori. A policy is described by a controller that computes orders to be sent to the motors out of sensor values. Both motor and sensor values can be continuous. The proposed approach relies on population based direct policy search algorithms, i.e. evolutionary algorithms. It exploits the numerous behaviors that are generated during the search. When learning on the source task, a knowledge base is built. The knowledge base aims at identifying the most salient behaviors segments with regards to the considered task. Afterwards, the knowledge base is exploited on a target task, with a reward shaping approach: besides its reward on the task, a policy is credited with a reward computed from the knowledge base. The rationale behind this approach is to automatically detect the stepping stones, i.e. the behavior segments that have lead to a reward in the source task before the policy is efficient enough to get the reward on the target task. The approach is tested in simulation with a neuroevolution approach and on ball collecting tasks.","PeriodicalId":106997,"journal":{"name":"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)","volume":"6 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEVLRN.2013.6652568","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

In the perspective of life long learning, a robot may face different, but related situations. Being able to exploit the knowledge acquired during a first learning phase may be critical in order to solve more complex tasks. This is the transfer learning problem. This problem is addressed here in the case of direct policy search algorithms. No discrete states, nor actions are defined a priori. A policy is described by a controller that computes orders to be sent to the motors out of sensor values. Both motor and sensor values can be continuous. The proposed approach relies on population based direct policy search algorithms, i.e. evolutionary algorithms. It exploits the numerous behaviors that are generated during the search. When learning on the source task, a knowledge base is built. The knowledge base aims at identifying the most salient behaviors segments with regards to the considered task. Afterwards, the knowledge base is exploited on a target task, with a reward shaping approach: besides its reward on the task, a policy is credited with a reward computed from the knowledge base. The rationale behind this approach is to automatically detect the stepping stones, i.e. the behavior segments that have lead to a reward in the source task before the policy is efficient enough to get the reward on the target task. The approach is tested in simulation with a neuroevolution approach and on ball collecting tasks.

查看原文本刊更多论文

直接政策搜索的迁移学习:奖励形成方法

从终身学习的角度来看，机器人可能会面临不同但相关的情况。为了解决更复杂的任务，能够利用在第一学习阶段获得的知识可能是至关重要的。这就是迁移学习问题。本文在直接策略搜索算法的情况下解决了这个问题。没有离散状态，也没有动作是先验定义的。策略由控制器描述，该控制器根据传感器值计算发送到电机的命令。电机和传感器的值都可以连续。该方法依赖于基于种群的直接策略搜索算法，即进化算法。它利用了搜索过程中生成的众多行为。在对源任务进行学习时，将构建知识库。知识库旨在识别与所考虑的任务相关的最显著的行为部分。然后，在目标任务上利用知识库，采用奖励塑造方法:除了任务上的奖励外，策略还获得从知识库中计算的奖励。这种方法背后的基本原理是自动检测垫脚石，即在策略有效到足以在目标任务上获得奖励之前，在源任务中导致奖励的行为片段。该方法在神经进化方法和球收集任务的模拟中进行了测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)

自引率

0.00%

发文量