Rudrabhotla Sri Prakash, N. Karamchandani, Sharayu Moharir
{"title":"样本路径相关强盗的最佳臂识别","authors":"Rudrabhotla Sri Prakash, N. Karamchandani, Sharayu Moharir","doi":"10.1109/NCC55593.2022.9806785","DOIUrl":null,"url":null,"abstract":"We consider the problem of best arm identification in the fixed confidence setting for a variant of the multi-arm bandit problem. In our problem, each arm is associated with two attributes, a known deterministic cost, and an unknown stochastic reward. In addition, it is known that arms with higher costs have higher rewards across every sample path. The net utility of each arm is defined as the difference between its expected reward and cost. We consider two information models, namely, the full information feedback and sequential bandit feedback. We derive a fundamental lower bound on the sample complexity of any policy and also propose policies with provable performance guarantees that exploit the structure of our problem. We supplement our analytical results by comparing the performance of various candidate policies via synthetic and data-driven simulations.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Best Arm Identification in Sample-path Correlated Bandits\",\"authors\":\"Rudrabhotla Sri Prakash, N. Karamchandani, Sharayu Moharir\",\"doi\":\"10.1109/NCC55593.2022.9806785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of best arm identification in the fixed confidence setting for a variant of the multi-arm bandit problem. In our problem, each arm is associated with two attributes, a known deterministic cost, and an unknown stochastic reward. In addition, it is known that arms with higher costs have higher rewards across every sample path. The net utility of each arm is defined as the difference between its expected reward and cost. We consider two information models, namely, the full information feedback and sequential bandit feedback. We derive a fundamental lower bound on the sample complexity of any policy and also propose policies with provable performance guarantees that exploit the structure of our problem. We supplement our analytical results by comparing the performance of various candidate policies via synthetic and data-driven simulations.\",\"PeriodicalId\":403870,\"journal\":{\"name\":\"2022 National Conference on Communications (NCC)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 National Conference on Communications (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC55593.2022.9806785\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Best Arm Identification in Sample-path Correlated Bandits
We consider the problem of best arm identification in the fixed confidence setting for a variant of the multi-arm bandit problem. In our problem, each arm is associated with two attributes, a known deterministic cost, and an unknown stochastic reward. In addition, it is known that arms with higher costs have higher rewards across every sample path. The net utility of each arm is defined as the difference between its expected reward and cost. We consider two information models, namely, the full information feedback and sequential bandit feedback. We derive a fundamental lower bound on the sample complexity of any policy and also propose policies with provable performance guarantees that exploit the structure of our problem. We supplement our analytical results by comparing the performance of various candidate policies via synthetic and data-driven simulations.