Online learning in structured Markov decision processes

Akbarzadeh, Nima

Online learning in structured Markov decision processes

buir.advisor	Tekin, Cem
dc.contributor.author	Akbarzadeh, Nima
dc.date.accessioned	2017-08-07T09:41:12Z
dc.date.available	2017-08-07T09:41:12Z
dc.date.copyright	2017-07
dc.date.issued	2017-07
dc.date.submitted	2017-08-03
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references (leaves 80-86).	en_US
dc.description.abstract	This thesis proposes three new multi-armed bandit problems, in which the learner proceeds in a sequence of rounds where each round is a Markov Decision Process (MDP). The learner's goal is to maximize its cumulative reward without any a priori knowledge on the state transition probabilities. The rst problem considers an MDP with sorted states and a continuation action that moves the learner to an adjacent state; and a terminal action that moves the learner to a terminal state (goal or dead-end state). In this problem, a round ends and the next round starts when a terminal state is reached, and the aim of the learner in each round is to reach the goal state. First, the structure of the optimal policy is derived. Then, the regret of the learner with respect to an oracle, who takes optimal actions in each round is de ned, and a learning algorithm that exploits the structure of the optimal policy is proposed. Finally, it is shown that the regret either increases logarithmically over rounds or becomes bounded. In the second problem, we investigate the personalization of a clinical treatment. This process is modeled as a goal-oriented MDP with dead-end states. Moreover, the state transition probabilities of the MDP depends on the context of the patients. An algorithm that uses the rule of optimism in face of uncertainty is proposed to maximize the number of rounds in which the goal state is reached. In the third problem, we propose an online learning algorithm for optimal execution in the limit order book of a nancial asset. Given a certain amount of shares to sell and an allocated time to complete the transaction, the proposed algorithm dynamically learns the optimal number of shares to sell at each time slot of the allocated time. We model this problem as an MDP, and derive the form of the optimal policy.	en_US
dc.description.statementofresponsibility	by Nima Akbarzadeh.	en_US
dc.embargo.release	2019-07-21
dc.format.extent	xi, 86 leaves : charts (some color) ; 29 cm.	en_US
dc.identifier.itemid	B156076
dc.identifier.uri	http://hdl.handle.net/11693/33535
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Online Learning	en_US
dc.subject	Markov Decision Process	en_US
dc.subject	Multi-armed Bandits	en_US
dc.subject	Reinforcement Learning	en_US
dc.subject	Dynamic Programming	en_US
dc.subject	Clinical Decision Making	en_US
dc.subject	Limit Order Book	en_US
dc.title	Online learning in structured Markov decision processes	en_US
dc.title.alternative	Özel yapılı Markov karar süreçlerinde çevrimiçi öğrenme	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Electrical and Electronic Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis.pdf
Size:: 1.31 MB
Format:: Adobe Portable Document Format
Description:: Full printable version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Graduate School of Engineering and Science