Index policy for multiarmed bandit problem with dynamic risk measures

Limited Access
This item is unavailable until:
2025-08-06

Date

2023-08-06

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

Source Title

European Journal of Operational Research

Print ISSN

0377-2217

Electronic ISSN

1872-6860

Publisher

Elsevier BV

Volume

312

Issue

2

Pages

627 - 640

Language

en

Journal Title

Journal ISSN

Volume Title

Series

Abstract

The multiarmed bandit problem (MAB) is a classic problem in which a finite amount of resources must be allocated among competing choices with the aim of identifying a policy that maximizes the expected total reward. MAB has a wide range of applications including clinical trials, portfolio design, tuning parameters, internet advertisement, auction mechanisms, adaptive routing in networks, and project management. The classical MAB makes the strong assumption that the decision maker is risk-neutral and indifferent to the variability of the outcome. However, in many real life applications, these assumptions are not met and decision makers are risk-averse. Motivated to resolve this, we study risk-averse control of the multiarmed bandit problem in regard to the concept of dynamic coherent risk measures to determine a policy with the best risk-adjusted total discounted return. In respect of this specific setting, we present a theoretical analysis based on Whittle’s retirement problem and propose a priority-index policy that reduces to the Gittins index when the level of risk-aversion converges to zero. We generalize the restart formulation of the Gittins index to effectively compute these risk-averse allocation indices. Numerical results exhibit the excellent performance of this heuristic approach for two well-known coherent risk measures of first-order mean-semideviation and mean-AVaR. Our experimental studies suggest that there is no guarantee that an index-based optimal policy exists for the risk-averse problem. Nonetheless, our risk-averse allocation indices can achieve optimal or near-optimal policies which in some instances are easier to interpret compared to the exact optimal policy.

Course

Other identifiers

Book Title

Citation