Global bandits

Atan, O.; Tekin, Cem; Schaar, M. V. D.

Global bandits

buir.contributor.author	Tekin, Cem
dc.citation.epage	5811	en_US
dc.citation.issueNumber	12	en_US
dc.citation.spage	5798	en_US
dc.citation.volumeNumber	29	en_US
dc.contributor.author	Atan, O.	en_US
dc.contributor.author	Tekin, Cem	en_US
dc.contributor.author	Schaar, M. V. D.	en_US
dc.date.accessioned	2019-02-21T16:05:50Z
dc.date.available	2019-02-21T16:05:50Z
dc.date.issued	2018	en_US
dc.department	Department of Electrical and Electronics Engineering	en_US
dc.description.abstract	Multiarmed bandits (MABs) model sequential decision-making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior works on MAB assume that the reward distributions of each arm are independent. But in a wide variety of decision problems - from drug dosage to dynamic pricing - the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call global bandits (GB). In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves bounded regret, with a bound that depends on the single true parameter of the problem. Hence, this policy selects suboptimal arms only finitely many times with probability one. For this case, we also obtain a bound on regret that is independent of the true parameter; this bound is sublinear, with an exponent that depends on the informativeness of the arms. We also propose a variant of the greedy policy that achieves O(√T) worst case and O(1) parameter-dependent regret. Finally, we perform experiments on dynamic pricing and show that the proposed algorithms achieve significant gains with respect to the well-known benchmarks.
dc.description.sponsorship	Manuscript received April 13, 2017; revised December 21, 2017; accepted March 1, 2018. Date of publication April 12, 2018; date of current version November 16, 2018. The work of O. Atan and M. van der Schaar was supported by the NSF under Grant 1533983, Grant 1407712, and Grant 1462245. This paper was presented at the 2015 International Conference on Artificial Intelligence and Statistics (AISTATS), San Diego, CA, USA, May 2015. (Corresponding author: Onur Atan.) O. Atan and M. van der Schaar are with the Department of Electrical Engineering, University of California at Los Angeles, Los Angeles, CA 90095 USA (e-mail: oatan@ucla.edu; mihaela@ee.ucla.edu).
dc.identifier.doi	10.1109/TNNLS.2018.2818742
dc.identifier.issn	2162-237X
dc.identifier.uri	http://hdl.handle.net/11693/50276
dc.language.iso	English
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.isversionof	https://doi.org/10.1109/TNNLS.2018.2818742
dc.relation.project	University of California, UC - National Science Foundation, NSF: 1462245 - National Science Foundation, NSF: 1533983 - National Science Foundation, NSF: 1407712
dc.source.title	IEEE Transactions on Neural Networks and Learning Systems	en_US
dc.subject	Bounded regret	en_US
dc.subject	Informative arms	en_US
dc.subject	Multiarmed bandits (MABs)	en_US
dc.subject	Online learning	en_US
dc.subject	Regret analysis	en_US
dc.title	Global bandits	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Global_bandits.pdf
Size:: 1.76 MB
Format:: Adobe Portable Document Format
Description:: Full printable version

Download

Collections

Scholarly Publications - Electrical and Electronics Engineering