{"title":"具有主导目标的多目标情境强盗","authors":"Cem Tekin, E. Turğay","doi":"10.1109/MLSP.2017.8168123","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.","PeriodicalId":6542,"journal":{"name":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","volume":"110 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Multi-Objective contextual bandits with a dominant objective\",\"authors\":\"Cem Tekin, E. Turğay\",\"doi\":\"10.1109/MLSP.2017.8168123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.\",\"PeriodicalId\":6542,\"journal\":{\"name\":\"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)\",\"volume\":\"110 1\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MLSP.2017.8168123\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2017.8168123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Objective contextual bandits with a dominant objective
In this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.