利用鞅分析伯努利盗匪的报酬

C. Leung, Longjun Hao
{"title":"利用鞅分析伯努利盗匪的报酬","authors":"C. Leung, Longjun Hao","doi":"10.1109/AIKE48582.2020.00015","DOIUrl":null,"url":null,"abstract":"Bernoulli bandits have found to mirror many practical situations in the context of reinforcement learning, and the aim is to maximize rewards through playing the machine over a set time frame. In an actual casino setting, it is often unrealistic to fix the time when playing stops, as the termination of play may be random and dependent on the outcomes of earlier lever pulls, which in turn affects the inclination of the gambler to continue playing. It is often assumed that exploration is repeated each time the game is played, and that the game tend to go on indefinitely. In practical situations, if the casino does not change their machines often, exploration need not be carried out repeatedly as this would be inefficient. Moreover, from the gamblers' point of view, they would likely to stop at some point or when certain conditions are fulfilled. Here, the bandit problem is studied in terms of stopping rules which are dependent on earlier random outcomes and on the behavior of the players. Rewards incorporating the cost of play and the size of payouts are then calculated on the conclusion of a playing episode. Here, the rewards for Bernoulli machines are placed within the context of martingales that are commonly used in gambling situations, and the fairness of the game is expressed through the parameters of the bandit machines which can be manifested as various forms of martingales. The average rewards and regrets as well as episode durations are obtained under different martingale stopping times. Exploration costs and regrets for different bandit machines are analyzed. Experimentation has also been undertaken which corroborate the theoretical results.","PeriodicalId":370671,"journal":{"name":"2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of Rewards in Bernoulli Bandits Using Martingales\",\"authors\":\"C. Leung, Longjun Hao\",\"doi\":\"10.1109/AIKE48582.2020.00015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bernoulli bandits have found to mirror many practical situations in the context of reinforcement learning, and the aim is to maximize rewards through playing the machine over a set time frame. In an actual casino setting, it is often unrealistic to fix the time when playing stops, as the termination of play may be random and dependent on the outcomes of earlier lever pulls, which in turn affects the inclination of the gambler to continue playing. It is often assumed that exploration is repeated each time the game is played, and that the game tend to go on indefinitely. In practical situations, if the casino does not change their machines often, exploration need not be carried out repeatedly as this would be inefficient. Moreover, from the gamblers' point of view, they would likely to stop at some point or when certain conditions are fulfilled. Here, the bandit problem is studied in terms of stopping rules which are dependent on earlier random outcomes and on the behavior of the players. Rewards incorporating the cost of play and the size of payouts are then calculated on the conclusion of a playing episode. Here, the rewards for Bernoulli machines are placed within the context of martingales that are commonly used in gambling situations, and the fairness of the game is expressed through the parameters of the bandit machines which can be manifested as various forms of martingales. The average rewards and regrets as well as episode durations are obtained under different martingale stopping times. Exploration costs and regrets for different bandit machines are analyzed. Experimentation has also been undertaken which corroborate the theoretical results.\",\"PeriodicalId\":370671,\"journal\":{\"name\":\"2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIKE48582.2020.00015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIKE48582.2020.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

伯努利盗匪已经发现了强化学习背景下的许多实际情况,其目标是通过在设定的时间框架内玩机器来最大化奖励。在实际的赌场环境中,设定游戏停止的时间通常是不现实的,因为游戏的终止可能是随机的,并且取决于先前杠杆拉动的结果,这反过来影响了赌徒继续游戏的倾向。人们通常认为,每次玩游戏都会重复探索,游戏往往会无限地进行下去。在实际情况下,如果赌场不经常更换他们的机器,则不需要重复进行勘探,因为这将是低效的。此外,从赌徒的角度来看,他们可能会在某一点或某些条件得到满足时停止赌博。在这里,土匪问题是根据停止规则来研究的,这些规则依赖于早期的随机结果和玩家的行为。结合游戏成本和支出规模的奖励是在游戏情节结束时计算出来的。在这里,伯努利机器的奖励被放置在赌博场景中常用的鞅的背景下,游戏的公平性是通过强盗机器的参数来表达的,这些参数可以表现为各种形式的鞅。得到了不同鞅停止时间下的平均奖励和遗憾以及事件持续时间。分析了不同盗掘机的勘探成本和遗憾。还进行了实验,证实了理论结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analysis of Rewards in Bernoulli Bandits Using Martingales
Bernoulli bandits have found to mirror many practical situations in the context of reinforcement learning, and the aim is to maximize rewards through playing the machine over a set time frame. In an actual casino setting, it is often unrealistic to fix the time when playing stops, as the termination of play may be random and dependent on the outcomes of earlier lever pulls, which in turn affects the inclination of the gambler to continue playing. It is often assumed that exploration is repeated each time the game is played, and that the game tend to go on indefinitely. In practical situations, if the casino does not change their machines often, exploration need not be carried out repeatedly as this would be inefficient. Moreover, from the gamblers' point of view, they would likely to stop at some point or when certain conditions are fulfilled. Here, the bandit problem is studied in terms of stopping rules which are dependent on earlier random outcomes and on the behavior of the players. Rewards incorporating the cost of play and the size of payouts are then calculated on the conclusion of a playing episode. Here, the rewards for Bernoulli machines are placed within the context of martingales that are commonly used in gambling situations, and the fairness of the game is expressed through the parameters of the bandit machines which can be manifested as various forms of martingales. The average rewards and regrets as well as episode durations are obtained under different martingale stopping times. Exploration costs and regrets for different bandit machines are analyzed. Experimentation has also been undertaken which corroborate the theoretical results.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信