Show simple item record

dc.contributor.authorAkbarzadeh, Nimaen_US
dc.contributor.authorTekin, Cemen_US
dc.coverage.spatialMonticello, IL, USAen_US
dc.date.accessioned2018-04-12T11:46:25Z
dc.date.available2018-04-12T11:46:25Z
dc.date.issued2017en_US
dc.identifier.urihttp://hdl.handle.net/11693/37637
dc.descriptionDate of Conference: 27-30 September 2016en_US
dc.descriptionConference Name: 54th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2016en_US
dc.description.abstractIn this paper, we propose a new multi-armed bandit problem called the Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence of rounds, where each round is a Markov Decision Process (MDP) with two actions (arms): a continuation action that moves the learner randomly over the state space around the current state; and a terminal action that moves the learner directly into one of the two terminal states (goal and dead-end state). The current round ends when a terminal state is reached, and the learner incurs a positive reward only when the goal state is reached. The objective of the learner is to maximize its long-term reward (expected number of times the goal state is reached), without having any prior knowledge on the state transition probabilities. We first prove a result on the form of the optimal policy for the GRBP. Then, we define the regret of the learner with respect to an omnipotent oracle, which acts optimally in each round, and prove that it increases logarithmically over rounds. We also identify a condition under which the learner's regret is bounded. A potential application of the GRBP is optimal medical treatment assignment, in which the continuation action corresponds to a conservative treatment and the terminal action corresponds to a risky treatment such as surgery.en_US
dc.language.isoEnglishen_US
dc.source.titleProceedings of the 54th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2016en_US
dc.relation.isversionofhttp://dx.doi.org/10.1109/ALLERTON.2016.7852376en_US
dc.subjectLearning algorithmsen_US
dc.subjectBandit problemsen_US
dc.subjectConservative treatmentsen_US
dc.subjectMarkov decision processesen_US
dc.subjectMedical treatmenten_US
dc.subjectMulti-armed bandit problemen_US
dc.subjectOptimal policiesen_US
dc.subjectPrior knowledgeen_US
dc.subjectState transition probabilitiesen_US
dc.titleGambler's ruin bandit problemen_US
dc.typeConference Paperen_US
dc.departmentDepartment of Electrical and Electronics Engineeringen_US
dc.citation.spage1236en_US
dc.citation.epage1243en_US
dc.identifier.doi10.1109/ALLERTON.2016.7852376en_US
dc.publisherIEEEen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record