{"title":"Mean-variance and value at risk in multi-armed bandit problems","authors":"Sattar Vakili, Qing Zhao","doi":"10.1109/ALLERTON.2015.7447162","DOIUrl":null,"url":null,"abstract":"We study risk-averse multi-armed bandit problems under different risk measures. We consider three risk mitigation models. In the first model, the variations in the reward values obtained at different times are considered as risk and the objective is to minimize the mean-variance of the observed rewards. In the second and the third models, the quantity of interest is the total reward at the end of the time horizon, and the objective is to minimize the mean-variance and maximize the value at risk of the total reward, respectively. We develop risk-averse online learning policies and analyze their regret performance. We also provide tight lower bounds on regret under the model of mean-variance of observations.","PeriodicalId":112948,"journal":{"name":"2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ALLERTON.2015.7447162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34
Abstract
We study risk-averse multi-armed bandit problems under different risk measures. We consider three risk mitigation models. In the first model, the variations in the reward values obtained at different times are considered as risk and the objective is to minimize the mean-variance of the observed rewards. In the second and the third models, the quantity of interest is the total reward at the end of the time horizon, and the objective is to minimize the mean-variance and maximize the value at risk of the total reward, respectively. We develop risk-averse online learning policies and analyze their regret performance. We also provide tight lower bounds on regret under the model of mean-variance of observations.