{"title":"Online learning with side information","authors":"Xiao Xu, Sattar Vakili, Qing Zhao, A. Swami","doi":"10.1109/MILCOM.2017.8170860","DOIUrl":null,"url":null,"abstract":"An online learning problem with side information is considered. The problem is formulated as a graph structured stochastic Multi-Armed Bandit (MAB). Each node in the graph represents an arm in the bandit problem and an edge between two arms indicates closeness in their mean rewards. It is shown that such side information induces a Unit Interval Graph and several graph properties can be leveraged to achieve a sublinear regret in the number of arms while preserving the optimal logarithmic regret in time. A lower bound on regret is established and a hierarchical learning policy that is order optimal in terms of both the number of arms and the learning horizon is developed.","PeriodicalId":113767,"journal":{"name":"MILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MILCOM.2017.8170860","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
An online learning problem with side information is considered. The problem is formulated as a graph structured stochastic Multi-Armed Bandit (MAB). Each node in the graph represents an arm in the bandit problem and an edge between two arms indicates closeness in their mean rewards. It is shown that such side information induces a Unit Interval Graph and several graph properties can be leveraged to achieve a sublinear regret in the number of arms while preserving the optimal logarithmic regret in time. A lower bound on regret is established and a hierarchical learning policy that is order optimal in terms of both the number of arms and the learning horizon is developed.