Browsing by Subject "Contextual MAB"

Now showing 1 - 3 of 3

Open Access
Contextual multi-armed bandits with structured payoffs
(2020-09) Qureshi, Muhammad Anjum
Multi-Armed Bandit (MAB) problems model sequential decision making under uncertainty. In traditional MAB, the learner selects an arm in each round, and then, observes a random reward from the arm’s unknown reward distribution. In the end, the goal is to maximize the cumulative reward by learning to select optimal arms as much as possible. In the contextual MAB—an extension to MAB—the learner observes a context (side-information) in the beginning of each round, selects an arm, and then, observes a random reward whose distribution depends on both the arriving context and the chosen arm. Another MAB variant, called unimodal MAB, assumes that the expected reward exhibits a unimodal structure over the arms, and tries to locate the arm with the “peak” reward by learning the direction of increase of the expected reward. In this thesis, we consider an extension to unimodal MAB called contextual unimodal MAB, and demonstrate that it is a powerful tool for designing Artificial Intelligence (AI)- enabled radios by utilizing the special structure of the dependence of the reward to contexts and arms of the wireless environment. While AI-enabled radios are expected to enhance the spectral efficiency of 5th generation (5G) millimeter wave (mmWave) networks by learning to optimize network resources, allocating resources over the mmWave band is extremely challenging due to rapidly-varying channel conditions. We consider several resource allocation problems in this thesis under various design possibilities for mmWave radio networks under unknown channel statistics and without any channel state information (CSI) feedback: i) dynamic rate selection for an energy harvesting transmitter, ii) dynamic power allocation for heterogeneous applications, and iii) distributed resource allocation in a multi-user network. All of these problems exhibit structured payoffs which are unimodal functions over partially ordered arms (transmission parameters) as well as unimodal or monotone functions over partially ordered contexts (side-information). Structure over arms helps in reducing the number of arms to be explored, while structure over contexts helps in using past information from nearby contexts to make better selections. We formalize dynamic adaptation of transmission parameters as a structured MAB, and propose frequentist and Bayesian online learning algorithms. We show that both approaches yield logarithmic in time regret. We also investigate dynamic rate and channel adaptation in a cognitive radio network serving heterogeneous applications under dynamically varying channel availability and rate constraints. We formalize the problem as a Bayesian learning problem, and propose a novel learning algorithm which considers each rate-channel pair as a two-dimensional action. The set of available actions varies dynamically over time due to variations in primary user activity and rate requirements of the applications served by the users. Additionally, we extend the work to cater to thescenario when the arms belong to a continuous interval as well as the contexts. Finally, we show via simulations that our algorithms significantly improve the performance in the aforementioned radio resource allocation problems.
Open Access
Fast learning for dynamic resource allocation in AI-Enabled radio networks
(IEEE, 2020) Qureshi, Muhammad Anjum; Tekin, Cem
Artificial Intelligence (AI)-enabled radios are expected to enhance the spectral efficiency of 5th generation (5G) millimeter wave (mmWave) networks by learning to optimize network resources. However, allocating resources over the mmWave band is extremely challenging due to rapidly-varying channel conditions. We consider several resource allocation problems for mmWave radio networks under unknown channel statistics and without any channel state information (CSI) feedback: i) dynamic rate selection for an energy harvesting transmitter, ii) dynamic power allocation for heterogeneous applications, and iii) distributed resource allocation in a multi-user network. All of these problems exhibit structured payoffs which are unimodal functions over partially ordered arms (transmission parameters) as well as over partially ordered contexts (side-information). Unimodality over arms helps in reducing the number of arms to be explored, while unimodality over contexts helps in using past information from nearby contexts to make better selections. We model this as a structured reinforcement learning problem, called contextual unimodal multi-armed bandit (MAB), and propose an online learning algorithm that exploits unimodality to optimize the resource allocation over time, and prove that it achieves logarithmic in time regret. Our algorithm's regret scales sublinearly both in the number of arms and contexts for a wide range of scenarios. We also show via simulations that our algorithm significantly improves the performance in the aforementioned resource allocation problems.
Open Access
Multi-objective contextual multi-armed bandit with a dominant objective
(IEEE, 2018) Tekin, Cem; Turgay, Eralp
We propose a new multi-objective contextual multiarmed bandit (MAB) problem with two objectives, where one of the objectives dominates the other objective. In the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the learner at the beginning of each round. We call this problem contextual multi-armed bandit with a dominant objective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its total reward in the dominant objective. In this case, the optimal arm given the context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. First, we show that the optimal arm lies in the Pareto front. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and define two performance measures: the 2-dimensional (2D) regret and the Pareto regret. We show that both the 2D regret and the Pareto regret of MOC-MAB are sublinear in the number of rounds. We also compare the performance of the proposed algorithm with other state-of-the-art methods in synthetic and real-world datasets. The proposed model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems