Online learning in structured Markov decision processes
Author
Akbarzadeh, Nima
Advisor
Tekin, Cem
Date
2017-07Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
49
views
views
19
downloads
downloads
Metadata
Show full item recordAbstract
This thesis proposes three new multi-armed bandit problems, in which the learner
proceeds in a sequence of rounds where each round is a Markov Decision Process
(MDP). The learner's goal is to maximize its cumulative reward without any a
priori knowledge on the state transition probabilities. The rst problem considers
an MDP with sorted states and a continuation action that moves the learner to an
adjacent state; and a terminal action that moves the learner to a terminal state
(goal or dead-end state). In this problem, a round ends and the next round starts
when a terminal state is reached, and the aim of the learner in each round is to
reach the goal state. First, the structure of the optimal policy is derived. Then,
the regret of the learner with respect to an oracle, who takes optimal actions in
each round is de ned, and a learning algorithm that exploits the structure of the
optimal policy is proposed. Finally, it is shown that the regret either increases
logarithmically over rounds or becomes bounded. In the second problem, we
investigate the personalization of a clinical treatment. This process is modeled
as a goal-oriented MDP with dead-end states. Moreover, the state transition
probabilities of the MDP depends on the context of the patients. An algorithm
that uses the rule of optimism in face of uncertainty is proposed to maximize the
number of rounds in which the goal state is reached. In the third problem, we
propose an online learning algorithm for optimal execution in the limit order book
of a nancial asset. Given a certain amount of shares to sell and an allocated time to complete the transaction, the proposed algorithm dynamically learns the
optimal number of shares to sell at each time slot of the allocated time. We model
this problem as an MDP, and derive the form of the optimal policy.
Keywords
Online LearningMarkov Decision Process
Multi-armed Bandits
Reinforcement Learning
Dynamic Programming
Clinical Decision Making
Limit Order Book