Index policy for multiarmed bandit problem with dynamic risk measures

buir.contributor.authorMalekipirbazari, Milad
buir.contributor.authorÇavus, Özlem
buir.contributor.orcidMalekipirbazari, Milad|0000-0002-3212-6498
dc.citation.epage640en_US
dc.citation.issueNumber2
dc.citation.spage627
dc.citation.volumeNumber312
dc.contributor.authorMalekipirbazari, Milad
dc.contributor.authorÇavus, Özlem
dc.date.accessioned2024-03-12T11:21:42Z
dc.date.available2024-03-12T11:21:42Z
dc.date.issued2023-08-06
dc.departmentDepartment of Industrial Engineering
dc.description.abstractThe multiarmed bandit problem (MAB) is a classic problem in which a finite amount of resources must be allocated among competing choices with the aim of identifying a policy that maximizes the expected total reward. MAB has a wide range of applications including clinical trials, portfolio design, tuning parameters, internet advertisement, auction mechanisms, adaptive routing in networks, and project management. The classical MAB makes the strong assumption that the decision maker is risk-neutral and indifferent to the variability of the outcome. However, in many real life applications, these assumptions are not met and decision makers are risk-averse. Motivated to resolve this, we study risk-averse control of the multiarmed bandit problem in regard to the concept of dynamic coherent risk measures to determine a policy with the best risk-adjusted total discounted return. In respect of this specific setting, we present a theoretical analysis based on Whittle’s retirement problem and propose a priority-index policy that reduces to the Gittins index when the level of risk-aversion converges to zero. We generalize the restart formulation of the Gittins index to effectively compute these risk-averse allocation indices. Numerical results exhibit the excellent performance of this heuristic approach for two well-known coherent risk measures of first-order mean-semideviation and mean-AVaR. Our experimental studies suggest that there is no guarantee that an index-based optimal policy exists for the risk-averse problem. Nonetheless, our risk-averse allocation indices can achieve optimal or near-optimal policies which in some instances are easier to interpret compared to the exact optimal policy.
dc.description.provenanceMade available in DSpace on 2024-03-12T11:21:42Z (GMT). No. of bitstreams: 1 Index_policy_for_multiarmed_bandit_problem_with_dynamic_risk_measures.pdf: 1492573 bytes, checksum: 61206af78bba4648dc0684653a6d6331 (MD5) Previous issue date: 2023-08-06en
dc.description.tableofcontentsStochastics and statistics
dc.embargo.release2025-08-06
dc.identifier.doi10.1016/j.ejor.2023.08.004
dc.identifier.eissn1872-6860
dc.identifier.issn0377-2217
dc.identifier.urihttps://hdl.handle.net/11693/114590
dc.language.isoen
dc.publisherElsevier BV
dc.relation.isversionofhttps://doi.org/10.1016/j.ejor.2023.08.004
dc.rightsCC BY-NC-ND 4.0 DEED (Attribution-NonCommercial-NoDerivs 4.0 International)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source.titleEuropean Journal of Operational Research
dc.subjectStochastic programming
dc.subjectMultiarmed bandit problem
dc.subjectGittins index
dc.subjectDynamic coherent risk measures
dc.subjectRisk-averse control
dc.titleIndex policy for multiarmed bandit problem with dynamic risk measures
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Index_policy_for_multiarmed_bandit_problem_with_dynamic_risk_measures.pdf
Size:
1.42 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.01 KB
Format:
Item-specific license agreed upon to submission
Description: