安全线性土匪

Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis
{"title":"安全线性土匪","authors":"Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis","doi":"10.1109/CISS50987.2021.9400288","DOIUrl":null,"url":null,"abstract":"Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\\mathcal{O}\\left(d^{3/2}log^{1/2}d\\sqrt{T}log^{2/3}T\\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].","PeriodicalId":228112,"journal":{"name":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAFE LINEAR BANDITS\",\"authors\":\"Ahmadreza Moradipari, Sanae Amani, M. Alizadeh, Christos Thrampoulidis\",\"doi\":\"10.1109/CISS50987.2021.9400288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\\\\mathcal{O}\\\\left(d^{3/2}log^{1/2}d\\\\sqrt{T}log^{2/3}T\\\\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].\",\"PeriodicalId\":228112,\"journal\":{\"name\":\"2021 55th Annual Conference on Information Sciences and Systems (CISS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 55th Annual Conference on Information Sciences and Systems (CISS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISS50987.2021.9400288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS50987.2021.9400288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

Bandit算法在安全关键型系统中有各种各样的应用,在这些系统中,尊重系统的底层约束是很重要的。挑战在于这些约束通常是未知的,因为它们依赖于强盗的未知参数。在这个演讲中,我们提出了一个线性随机多臂强盗问题,其安全约束线性依赖于一个未知参数向量。因此,学习者无法识别所有的安全动作,必须采取保守的行动,以确保他们的行动在所有回合中都满足安全约束(至少在高概率下)。对于这些强盗,我们提出了新的上置信度界(UCB)和汤普森采样算法,其中包括必要的修改,以尊重安全约束。对于两种设置-有和没有强盗反馈信息的约束-我们证明了遗憾边界,并讨论了在没有安全限制的情况下它们相对于相应边界的最优性。例如,对于约束上有强盗反馈信息的设置,我们给出了阶为$\mathcal{O}\left(d^{3/2}log^{1/2}d\sqrt{T}log^{2/3}T\right)$的频率遗憾,它与[1]为标准线性汤普森采样算法提供的结果非常匹配。我们强调了汤普森抽样固有的随机性质如何帮助扩展算法在每一轮访问的安全动作集。最后,我们讨论了与阶段基线约束相关的问题变化,其中学习者必须选择的行动不仅要在整个时间范围内最大化累积奖励,而且还要进一步满足以瞬时奖励下界形式存在的线性基线约束。本次演讲的内容基于[2]-[4]。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SAFE LINEAR BANDITS
Bandit algorithms have various applications in safety-critical systems, where it is important to respect the system's underlying constraints. The challenge is that such constraints are often unknown as they depend on the bandit's unknown parameters. In this talk, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend linearly on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that their actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose new upper-confidence bound (UCB) and Thompson-sampling algorithms, which include necessary modifications to respect the safety constraints. For two settings -with and without bandit feedback information on the constraint- we prove regret bounds and discuss their optimality in relation to corresponding bounds in the absence of safety restrictions. For example, for a setting with bandit-feedback information on the constraint, we present a frequentist regret of order $\mathcal{O}\left(d^{3/2}log^{1/2}d\sqrt{T}log^{2/3}T\right)$, which remarkably matches the results provided by [1] for the standard linear Thompson-sampling algorithm. We highlight how the inherently randomized nature of Thompson-sampling helps expand the set of safe actions the algorithm has access to at each round. Finally, we discuss related problem variations with stage-wise baseline constraints, in which the learner must choose actions that not only maximize cumulative reward across the entire time horizon, but they further satisfy a linear baseline constraint taking the form of a lower bound on the instantaneous reward. The content of this talk is based on [2]–[4].
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信