• About
  • Policies
  • What is open access
  • Library
  • Contact
Advanced search
      View Item 
      •   BUIR Home
      • Scholarly Publications
      • Faculty of Engineering
      • Department of Electrical and Electronics Engineering
      • View Item
      •   BUIR Home
      • Scholarly Publications
      • Faculty of Engineering
      • Department of Electrical and Electronics Engineering
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Global bandits

      Thumbnail
      View / Download
      1.8 Mb
      Author
      Atan, O.
      Tekin, C.
      Schaar, M. V. D.
      Date
      2018
      Source Title
      IEEE Transactions on Neural Networks and Learning Systems
      Print ISSN
      2162-237X
      Publisher
      Institute of Electrical and Electronics Engineers
      Volume
      29
      Issue
      12
      Pages
      5798 - 5811
      Language
      English
      Type
      Article
      Item Usage Stats
      118
      views
      107
      downloads
      Abstract
      Multiarmed bandits (MABs) model sequential decision-making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior works on MAB assume that the reward distributions of each arm are independent. But in a wide variety of decision problems - from drug dosage to dynamic pricing - the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call global bandits (GB). In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves bounded regret, with a bound that depends on the single true parameter of the problem. Hence, this policy selects suboptimal arms only finitely many times with probability one. For this case, we also obtain a bound on regret that is independent of the true parameter; this bound is sublinear, with an exponent that depends on the informativeness of the arms. We also propose a variant of the greedy policy that achieves O(√T) worst case and O(1) parameter-dependent regret. Finally, we perform experiments on dynamic pricing and show that the proposed algorithms achieve significant gains with respect to the well-known benchmarks.
      Keywords
      Bounded regret
      Informative arms
      Multiarmed bandits (MABs)
      Online learning
      Regret analysis
      Permalink
      http://hdl.handle.net/11693/50276
      Published Version (Please cite this version)
      https://doi.org/10.1109/TNNLS.2018.2818742
      Collections
      • Department of Electrical and Electronics Engineering 3650
      Show full item record

      Browse

      All of BUIRCommunities & CollectionsTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsThis CollectionTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartments

      My Account

      Login

      Statistics

      View Usage StatisticsView Google Analytics Statistics

      Bilkent University

      If you have trouble accessing this page and need to request an alternate format, contact the site administrator. Phone: (312) 290 1771
      © Bilkent University - Library IT

      Contact Us | Send Feedback | Off-Campus Access | Admin | Privacy