Browsing by Subject "Multi-armed bandit"

Now showing 1 - 6 of 6

Open Access
Contextual combinatorial volatile multi-armed bandits in compact context spaces
(2021-07) Nika, Andi
We consider the contextual combinatorial volatile multi-armed bandit (CCV-MAB) problem in compact context spaces, simultaneously taking into consideration all of its individual features, thus providing a general framework for solving a wide range of practical problems. We solve CCV-MAB using two approaches. First, we use the so called adaptive discretization technique which sequentially partitions the context space X into ’regions of similarity’ and stores similar statistics corresponding to such regions. Under monotonicity of the expected reward and mild continuity assumptions, for both the expected reward and the expected base arm outcomes, we propose Adap-tive Contextual Combinatorial Upper Confidence Bound (ACC-UCB), an online learn-ing algorithm that uses adaptive discretization and incurs O˜(T ( ¯ +1)/( ¯ +2)+) regret for any  > 0, where ¯ represents the approximate optimality dimension related to X . This dimension captures both the benignness of the base arm arrivals and the struc-ture of the expected reward. Second, we impose a Gaussian process (GP) structure on the expected base arms outcomes and thus, using the smoothness of the GP posterior, eliminate the need for adaptive discretization. We propose Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O’CLOK-UCB) which incurs O˜(K√T γ¯T ) regret, where γ¯T is the maximum information gain associ-ated with the set of base arm contexts that appeared in the first T rounds and K here is the maximum cardinality of any feasible super arm over all rounds. For both methods, we provide experimental results which conclude in the superiority of ACC-UCB over the previous state-of-the-art and of O’CLOCK-UCB over ACC-UCB.
Open Access
Diabetes management VIA gaussian process bandits
(2021-10) Çelik, Ahmet Alparslan
Management of chronic diseases such as diabetes mellitus requires adaptation of treatment regimes based on patient characteristics and response. There is no single treatment that ﬁts all patients in all contexts; moreover, the set of admissible treatments usually varies over the course of the disease. In this thesis, we address the problem of optimizing treatment regimes under time-varying constraints by using volatile contextual Gaussian process bandits. In particular, we propose a variant of GP-UCB with volatile arms, which takes into account the patient’s context together with the set of admissible treatments when recommending new treatments. Our Bayesian approach is able to provide treatment recommendations to the patients along with conﬁdence scores which can be used for risk assessment. We use our algorithm to recommend bolus insulin doses for type 1 diabetes mellitus patients. We test our algorithm on in-silico subjects that come with open source implementation of the FDA-approved UVa/Padova type 1 diabetes mellitus simulator. We also compare its performance against a clinician. Moreover, we present a pilot study with a few clinicians and patients, where we design interfaces that they can interact with the model. Meanwhile, we address issues regarding privacy, safety, and ethics. Simulation studies show that our algorithm compares favorably with traditional blood glucose regulation methods.
Open Access
Feedback adaptive learning for medical and educational application recommendation
(IEEE, 2020) Tekin, Cem; Elahi, Sepehr; Van Der Schaar, M.
Recommending applications (apps) to improve health or educational outcomes requires long-term planning and adaptation based on the user feedback, as it is imperative to recommend the right app at the right time to improve engagement and benefit. We model the challenging task of app recommendation for these specific categories of apps-or alike-using a new reinforcement learning method referred to as episodic multi-armed bandit (eMAB). In eMAB, the learner recommends apps to individual users and observes their interactions with the recommendations on a weekly basis. It then uses this data to maximize the total payoff of all users by learning to recommend specific apps. Since computing the optimal recommendation sequence is intractable, as a benchmark, we define an oracle that sequentially recommends apps to maximize the expected immediate gain. Then, we propose our online learning algorithm, named FeedBack Adaptive Learning (FeedBAL), and prove that its regret with respect to the benchmark increases logarithmically in expectation. We demonstrate the effectiveness of FeedBAL on recommending mental health apps based on data from an app suite and show that it results in a substantial increase in the number of app sessions compared with episodic versions of ϵn -greedy, Thompson sampling, and collaborative filtering methods.
Open Access
Generalized global bandit and its application in cellular coverage optimization
(Institute of Electrical and Electronics Engineers, 2018) Shen, C.; Zhou, R.; Tekin, Cem; Schaar, M. V. D.
Motivated by the engineering problem of cellular coverage optimization, we propose a novel multiarmed bandit model called generalized global bandit. We develop a series of greedy algorithms that have the capability to handle nonmonotonic but decomposable reward functions, multidimensional global parameters, and switching costs. The proposed algorithms are rigorously analyzed under the multiarmed bandit framework, where we show that they achieve bounded regret, and hence, they are guaranteed to converge to the optimal arm in finite time. The algorithms are then applied to the cellular coverage optimization problem to achieve the optimal tradeoff between sufficient small cell coverage and limited macroleakage without prior knowledge of the deployment environment. The performance advantage of the new algorithms over existing bandits solutions is revealed analytically and further confirmed via numerical simulations. The key element behind the performance improvement is a more efficient 'trial and error' mechanism, in which any trial will help improve the knowledge of all candidate power levels.
Open Access
Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives
(Springer, 2021-06) Hüyük, A.; Tekin, Cem
We consider multi-objective multi-armed bandit with (i) lexicographically ordered and (ii) satisficing objectives. In the first problem, the goal is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multi-dimensional form of regret that measures the loss due to not selecting lexicographic optimal arms, and then, propose an algorithm that achieves O~(T2/3) gap-free regret and prove a regret lower bound of Ω(T2/3). We also consider two additional settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective a near-lexicographic optimal expected reward. For both settings, we prove that the learner achieves expected regret uniformly bounded in time. Then, we show that the algorithm we propose for the second setting of lexicographically ordered objectives with prior information also attains bounded regret for satisficing objectives. Finally, we experimentally evaluate the proposed algorithms in a variety of multi-objective learning problems.
Open Access
Risk-averse multi-armed bandit problem
(2021-08) Malekipirbazari, Milad
In classical multi-armed bandit problem, the aim is to ﬁnd a policy maximizing the expected total reward, implicitly assuming that the decision maker is risk-neutral. On the other hand, the decision makers are risk-averse in some real life applications. In this study, we design a new setting for the classical multi-armed bandit problem (MAB) based on the concept of dynamic risk measures, where the aim is to ﬁnd a policy with the best risk adjusted total discounted outcome. We provide theoretical analysis of MAB with respect to this novel setting, and propose two diﬀerent priority-index heuristics giving risk-averse allocation indices with structures similar to Gittins index. The ﬁrst proposed heuristic is based on Lagrangian duality and the indices are expressed as the Lagrangian multiplier corresponding to the activation constraint. In the second part, we present a theoretical analysis based on Whittle’s retirement problem and propose a gener-alized version of restart-in-state formulation of the Gittins index to compute the proposed risk-averse allocation indices. Finally, as a practical application of the proposed methods, we focus on optimal design of clinical trials and we apply our risk-averse MAB approach to perform risk-averse treatment allocation based on a Bayesian Bernoulli model. We evaluate the performance of our approach against other allocation rules, including ﬁxed randomization.