{"title":"适应多臂匪徒中转换成本的简单修复方法","authors":"","doi":"10.1016/j.ejor.2024.09.017","DOIUrl":null,"url":null,"abstract":"<div><div>When switching costs are added to the multi-armed bandit (MAB) problem where the arms’ random reward distributions are previously unknown, usually quite different techniques than those for pure MAB are required. We find that two simple fixes on the existing upper-confidence-bound (UCB) policy can work well for MAB with switching costs (MAB-SC). Two cases should be distinguished. One is with <em>positive-gap</em> ambiguity where the performance gap between the leading and lagging arms is known to be at least some <span><math><mrow><mi>δ</mi><mo>></mo><mn>0</mn></mrow></math></span>. For this, our fix is to erect barriers that discourage frivolous arm switchings. The other is with <em>zero-gap</em> ambiguity where absolutely nothing is known. We remedy this by forcing the same arms to be pulled in increasingly prolonged intervals. As usual, the effectivenesses of our fixes are measured by the worst average regrets over long time horizons <span><math><mi>T</mi></math></span>. When the barriers are fixed at <span><math><mrow><mi>δ</mi><mo>/</mo><mn>2</mn></mrow></math></span>, we can accomplish a <span><math><mrow><mo>ln</mo><mrow><mo>(</mo><mi>T</mi><mo>)</mo></mrow></mrow></math></span>-sized regret bound for the positive-gap case. When intervals are such that <span><math><mi>n</mi></math></span> of them occupy <span><math><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> periods, we can achieve the best possible <span><math><msup><mrow><mi>T</mi></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></math></span>-sized regret bound for the zero-gap case. Other than UCB, these fixes can be applied to a learning while doing (LWD) heuristic to reach satisfactory results as well. While not yet with the best theoretical guarantees, the LWD-based policies have empirically outperformed those based on UCB and other known alternatives. Numerically competitive policies still include ones resulting from interval-based fixes on Thompson sampling (TS).</div></div>","PeriodicalId":55161,"journal":{"name":"European Journal of Operational Research","volume":null,"pages":null},"PeriodicalIF":6.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Simple fixes that accommodate switching costs in multi-armed bandits\",\"authors\":\"\",\"doi\":\"10.1016/j.ejor.2024.09.017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>When switching costs are added to the multi-armed bandit (MAB) problem where the arms’ random reward distributions are previously unknown, usually quite different techniques than those for pure MAB are required. We find that two simple fixes on the existing upper-confidence-bound (UCB) policy can work well for MAB with switching costs (MAB-SC). Two cases should be distinguished. One is with <em>positive-gap</em> ambiguity where the performance gap between the leading and lagging arms is known to be at least some <span><math><mrow><mi>δ</mi><mo>></mo><mn>0</mn></mrow></math></span>. For this, our fix is to erect barriers that discourage frivolous arm switchings. The other is with <em>zero-gap</em> ambiguity where absolutely nothing is known. We remedy this by forcing the same arms to be pulled in increasingly prolonged intervals. As usual, the effectivenesses of our fixes are measured by the worst average regrets over long time horizons <span><math><mi>T</mi></math></span>. When the barriers are fixed at <span><math><mrow><mi>δ</mi><mo>/</mo><mn>2</mn></mrow></math></span>, we can accomplish a <span><math><mrow><mo>ln</mo><mrow><mo>(</mo><mi>T</mi><mo>)</mo></mrow></mrow></math></span>-sized regret bound for the positive-gap case. When intervals are such that <span><math><mi>n</mi></math></span> of them occupy <span><math><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> periods, we can achieve the best possible <span><math><msup><mrow><mi>T</mi></mrow><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></math></span>-sized regret bound for the zero-gap case. Other than UCB, these fixes can be applied to a learning while doing (LWD) heuristic to reach satisfactory results as well. While not yet with the best theoretical guarantees, the LWD-based policies have empirically outperformed those based on UCB and other known alternatives. Numerically competitive policies still include ones resulting from interval-based fixes on Thompson sampling (TS).</div></div>\",\"PeriodicalId\":55161,\"journal\":{\"name\":\"European Journal of Operational Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Journal of Operational Research\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0377221724007203\",\"RegionNum\":2,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPERATIONS RESEARCH & MANAGEMENT SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Operational Research","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0377221724007203","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPERATIONS RESEARCH & MANAGEMENT SCIENCE","Score":null,"Total":0}
引用次数: 0
摘要
在多臂强盗(MAB)问题中加入转换成本时,由于各臂的随机奖励分布是未知的,通常需要采用与纯 MAB 完全不同的技术。我们发现,对现有的置信上限(UCB)策略进行两个简单的修正,就能很好地解决有转换成本的多臂强盗(MAB)问题(MAB-SC)。需要区分两种情况。一种情况是正差距模糊性,即已知领先臂和落后臂之间的性能差距至少为某个 δ>0。对于这种情况,我们的解决办法是设置障碍,阻止轻率的臂切换。另一种情况是零间隙模糊性,在这种情况下,我们完全不知道任何事情。对此,我们的补救措施是强迫在越来越长的时间间隔内拉动相同的臂。像往常一样,我们的固定方法的有效性是通过长时间跨度 T 的最坏平均遗憾来衡量的。当障碍固定在 δ/2 时,我们可以在正间隙情况下实现 ln(T)-sized regret bound。当时间间隔为 n 个,其中 n 个占据 n2 个周期时,我们就能在零间隙情况下实现最佳的 T1/2 大小的遗憾约束。除 UCB 外,这些修正也可以应用于边做边学(LWD)启发式,以获得令人满意的结果。基于 LWD 的策略虽然尚未获得最佳的理论保证,但在经验上已经优于基于 UCB 和其他已知替代方法的策略。在数值上具有竞争力的策略还包括基于汤普森采样(Thompson sampling,TS)的区间固定策略。
Simple fixes that accommodate switching costs in multi-armed bandits
When switching costs are added to the multi-armed bandit (MAB) problem where the arms’ random reward distributions are previously unknown, usually quite different techniques than those for pure MAB are required. We find that two simple fixes on the existing upper-confidence-bound (UCB) policy can work well for MAB with switching costs (MAB-SC). Two cases should be distinguished. One is with positive-gap ambiguity where the performance gap between the leading and lagging arms is known to be at least some . For this, our fix is to erect barriers that discourage frivolous arm switchings. The other is with zero-gap ambiguity where absolutely nothing is known. We remedy this by forcing the same arms to be pulled in increasingly prolonged intervals. As usual, the effectivenesses of our fixes are measured by the worst average regrets over long time horizons . When the barriers are fixed at , we can accomplish a -sized regret bound for the positive-gap case. When intervals are such that of them occupy periods, we can achieve the best possible -sized regret bound for the zero-gap case. Other than UCB, these fixes can be applied to a learning while doing (LWD) heuristic to reach satisfactory results as well. While not yet with the best theoretical guarantees, the LWD-based policies have empirically outperformed those based on UCB and other known alternatives. Numerically competitive policies still include ones resulting from interval-based fixes on Thompson sampling (TS).
期刊介绍:
The European Journal of Operational Research (EJOR) publishes high quality, original papers that contribute to the methodology of operational research (OR) and to the practice of decision making.