使用强化学习和策略编排来教授人工智能代理的道德价值观

Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R. Varshney, Murray Campbell, Moninder Singh, F. Rossi
{"title":"使用强化学习和策略编排来教授人工智能代理的道德价值观","authors":"Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R. Varshney, Murray Campbell, Moninder Singh, F. Rossi","doi":"10.24963/ijcai.2019/891","DOIUrl":null,"url":null,"abstract":"Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. \nWe detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.","PeriodicalId":13051,"journal":{"name":"IBM J. Res. Dev.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":"{\"title\":\"Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration\",\"authors\":\"Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R. Varshney, Murray Campbell, Moninder Singh, F. Rossi\",\"doi\":\"10.24963/ijcai.2019/891\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. \\nWe detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.\",\"PeriodicalId\":13051,\"journal\":{\"name\":\"IBM J. Res. Dev.\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"56\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IBM J. Res. Dev.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24963/ijcai.2019/891\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM J. Res. Dev.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24963/ijcai.2019/891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 56

摘要

自主的网络物理代理在我们的生活中扮演着越来越重要的角色。为了确保他们的行为方式与社会价值观一致,我们必须开发技术,使这些代理不仅能够在环境中最大化他们的回报,而且还能够学习和遵循社会的隐性约束。我们详细介绍了一种新方法,该方法使用逆强化学习从演示中学习一组未指定的约束,并使用强化学习来学习最大化环境奖励。然后,基于上下文强盗的编排器在两种策略之间进行选择:基于约束的策略和基于环境奖励的策略。上下文强盗协调器允许代理以新颖的方式混合策略,从奖励最大化或约束策略中采取最佳行动。此外,编排器对于在每个时间步骤使用哪个策略是透明的。我们使用《吃豆人》测试我们的算法,并表明代理能够学习最佳行动,在演示的约束条件下行动,并以复杂的方式混合这两种功能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration
Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信