Multi-objective contextual bandits with a dominant objective
Author
Tekin, Cem
Turgay, Eralp
Date
2017Source Title
Proceedings of the IEEE 27th International Workshop on Machine Learning for Signal Processing, MLSP 2017
Print ISSN
2161-0363
Publisher
IEEE
Language
English
Type
Conference PaperItem Usage Stats
159
views
views
159
downloads
downloads
Abstract
In this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.
Keywords
Contextual banditsDominant objective
Multi-objective bandits
Online learning
Regret bounds
Artificial intelligence
Diagnosis
Learning systems
Wireless telecommunication systems
Signal processing