• About
  • Policies
  • What is openaccess
  • Library
  • Contact
Advanced search
      View Item 
      •   BUIR Home
      • University Library
      • Bilkent Theses
      • Theses - Department of Electrical and Electronics Engineering
      • Dept. of Electrical and Electronics Engineering - Master's degree
      • View Item
      •   BUIR Home
      • University Library
      • Bilkent Theses
      • Theses - Department of Electrical and Electronics Engineering
      • Dept. of Electrical and Electronics Engineering - Master's degree
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.

      Online learning in structured Markov decision processes

      Thumbnail
      View / Download
      1.3 Mb
      Author
      Akbarzadeh, Nima
      Advisor
      Tekin, Cem
      Date
      2017-07
      Publisher
      Bilkent University
      Language
      English
      Type
      Thesis
      Item Usage Stats
      49
      views
      19
      downloads
      Metadata
      Show full item record
      Abstract
      This thesis proposes three new multi-armed bandit problems, in which the learner proceeds in a sequence of rounds where each round is a Markov Decision Process (MDP). The learner's goal is to maximize its cumulative reward without any a priori knowledge on the state transition probabilities. The rst problem considers an MDP with sorted states and a continuation action that moves the learner to an adjacent state; and a terminal action that moves the learner to a terminal state (goal or dead-end state). In this problem, a round ends and the next round starts when a terminal state is reached, and the aim of the learner in each round is to reach the goal state. First, the structure of the optimal policy is derived. Then, the regret of the learner with respect to an oracle, who takes optimal actions in each round is de ned, and a learning algorithm that exploits the structure of the optimal policy is proposed. Finally, it is shown that the regret either increases logarithmically over rounds or becomes bounded. In the second problem, we investigate the personalization of a clinical treatment. This process is modeled as a goal-oriented MDP with dead-end states. Moreover, the state transition probabilities of the MDP depends on the context of the patients. An algorithm that uses the rule of optimism in face of uncertainty is proposed to maximize the number of rounds in which the goal state is reached. In the third problem, we propose an online learning algorithm for optimal execution in the limit order book of a nancial asset. Given a certain amount of shares to sell and an allocated time to complete the transaction, the proposed algorithm dynamically learns the optimal number of shares to sell at each time slot of the allocated time. We model this problem as an MDP, and derive the form of the optimal policy.
      Keywords
      Online Learning
      Markov Decision Process
      Multi-armed Bandits
      Reinforcement Learning
      Dynamic Programming
      Clinical Decision Making
      Limit Order Book
      Embargo Lift Date
      2019-07-21
      Permalink
      http://hdl.handle.net/11693/33535
      Collections
      • Dept. of Electrical and Electronics Engineering - Master's degree 567

      Browse

      All of BUIRCommunities & CollectionsTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartmentsThis CollectionTitlesAuthorsAdvisorsBy Issue DateKeywordsTypeDepartments

      My Account

      Login

      Statistics

      View Usage Statistics

      Bilkent University

      If you have trouble accessing this page and need to request an alternate format, contact the site administrator. Phone: (312) 290 1771
      Copyright © Bilkent University | Library IT

      Contact Us | Send Feedback | Off-Campus Access | Admin