Index policy for multiarmed bandit problem with dynamic risk measures

Malekipirbazari, Milad; Çavus, Özlem

Index policy for multiarmed bandit problem with dynamic risk measures

buir.contributor.author	Malekipirbazari, Milad
buir.contributor.author	Çavus, Özlem
buir.contributor.orcid	Malekipirbazari, Milad\|0000-0002-3212-6498
dc.citation.epage	640	en_US
dc.citation.issueNumber	2
dc.citation.spage	627
dc.citation.volumeNumber	312
dc.contributor.author	Malekipirbazari, Milad
dc.contributor.author	Çavus, Özlem
dc.date.accessioned	2024-03-12T11:21:42Z
dc.date.available	2024-03-12T11:21:42Z
dc.date.issued	2023-08-06
dc.department	Department of Industrial Engineering
dc.description.abstract	The multiarmed bandit problem (MAB) is a classic problem in which a finite amount of resources must be allocated among competing choices with the aim of identifying a policy that maximizes the expected total reward. MAB has a wide range of applications including clinical trials, portfolio design, tuning parameters, internet advertisement, auction mechanisms, adaptive routing in networks, and project management. The classical MAB makes the strong assumption that the decision maker is risk-neutral and indifferent to the variability of the outcome. However, in many real life applications, these assumptions are not met and decision makers are risk-averse. Motivated to resolve this, we study risk-averse control of the multiarmed bandit problem in regard to the concept of dynamic coherent risk measures to determine a policy with the best risk-adjusted total discounted return. In respect of this specific setting, we present a theoretical analysis based on Whittle’s retirement problem and propose a priority-index policy that reduces to the Gittins index when the level of risk-aversion converges to zero. We generalize the restart formulation of the Gittins index to effectively compute these risk-averse allocation indices. Numerical results exhibit the excellent performance of this heuristic approach for two well-known coherent risk measures of first-order mean-semideviation and mean-AVaR. Our experimental studies suggest that there is no guarantee that an index-based optimal policy exists for the risk-averse problem. Nonetheless, our risk-averse allocation indices can achieve optimal or near-optimal policies which in some instances are easier to interpret compared to the exact optimal policy.
dc.description.tableofcontents	Stochastics and statistics
dc.embargo.release	2025-08-06
dc.identifier.doi	10.1016/j.ejor.2023.08.004
dc.identifier.eissn	1872-6860
dc.identifier.issn	0377-2217
dc.identifier.uri	https://hdl.handle.net/11693/114590
dc.language.iso	en
dc.publisher	Elsevier BV
dc.relation.isversionof	https://doi.org/10.1016/j.ejor.2023.08.004
dc.rights	CC BY-NC-ND 4.0 DEED (Attribution-NonCommercial-NoDerivs 4.0 International)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source.title	European Journal of Operational Research
dc.subject	Stochastic programming
dc.subject	Multiarmed bandit problem
dc.subject	Gittins index
dc.subject	Dynamic coherent risk measures
dc.subject	Risk-averse control
dc.title	Index policy for multiarmed bandit problem with dynamic risk measures
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Index_policy_for_multiarmed_bandit_problem_with_dynamic_risk_measures.pdf
Size:: 1.42 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.01 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Scholarly Publications - Industrial Engineering