{"title":"携带昂贵探测器的多武装土匪","authors":"Eray Can Elumar;Cem Tekin;Osman Yağan","doi":"10.1109/TIT.2024.3506866","DOIUrl":null,"url":null,"abstract":"Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost \n<inline-formula> <tex-math>$c\\geq 0$ </tex-math></inline-formula>\n to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as \n<inline-formula> <tex-math>$ O(\\sqrt {KT\\log T})$ </tex-math></inline-formula>\n, and an order optimal gap-dependent upper bound of \n<inline-formula> <tex-math>$ O(K\\log T)$ </tex-math></inline-formula>\n. As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of \n<inline-formula> <tex-math>$O(K\\sqrt {T\\log T})$ </tex-math></inline-formula>\n, and gap-dependent regret bound of \n<inline-formula> <tex-math>$O(K^{2}\\log T)$ </tex-math></inline-formula>\n; and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 1","pages":"618-643"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Armed Bandits With Costly Probes\",\"authors\":\"Eray Can Elumar;Cem Tekin;Osman Yağan\",\"doi\":\"10.1109/TIT.2024.3506866\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost \\n<inline-formula> <tex-math>$c\\\\geq 0$ </tex-math></inline-formula>\\n to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as \\n<inline-formula> <tex-math>$ O(\\\\sqrt {KT\\\\log T})$ </tex-math></inline-formula>\\n, and an order optimal gap-dependent upper bound of \\n<inline-formula> <tex-math>$ O(K\\\\log T)$ </tex-math></inline-formula>\\n. As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of \\n<inline-formula> <tex-math>$O(K\\\\sqrt {T\\\\log T})$ </tex-math></inline-formula>\\n, and gap-dependent regret bound of \\n<inline-formula> <tex-math>$O(K^{2}\\\\log T)$ </tex-math></inline-formula>\\n; and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"71 1\",\"pages\":\"618-643\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10767721/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10767721/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost
$c\geq 0$
to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as
$ O(\sqrt {KT\log T})$
, and an order optimal gap-dependent upper bound of
$ O(K\log T)$
. As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of
$O(K\sqrt {T\log T})$
, and gap-dependent regret bound of
$O(K^{2}\log T)$
; and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.