Browsing by Subject "Regret bounds"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
Item Open Access Contextual multi-armed bandits with structured payoffs(Bilkent University, 2020-09) Qureshi, Muhammad AnjumMulti-Armed Bandit (MAB) problems model sequential decision making under uncertainty. In traditional MAB, the learner selects an arm in each round, and then, observes a random reward from the arm’s unknown reward distribution. In the end, the goal is to maximize the cumulative reward by learning to select optimal arms as much as possible. In the contextual MAB—an extension to MAB—the learner observes a context (side-information) in the beginning of each round, selects an arm, and then, observes a random reward whose distribution depends on both the arriving context and the chosen arm. Another MAB variant, called unimodal MAB, assumes that the expected reward exhibits a unimodal structure over the arms, and tries to locate the arm with the “peak” reward by learning the direction of increase of the expected reward. In this thesis, we consider an extension to unimodal MAB called contextual unimodal MAB, and demonstrate that it is a powerful tool for designing Artificial Intelligence (AI)- enabled radios by utilizing the special structure of the dependence of the reward to contexts and arms of the wireless environment. While AI-enabled radios are expected to enhance the spectral efficiency of 5th generation (5G) millimeter wave (mmWave) networks by learning to optimize network resources, allocating resources over the mmWave band is extremely challenging due to rapidly-varying channel conditions. We consider several resource allocation problems in this thesis under various design possibilities for mmWave radio networks under unknown channel statistics and without any channel state information (CSI) feedback: i) dynamic rate selection for an energy harvesting transmitter, ii) dynamic power allocation for heterogeneous applications, and iii) distributed resource allocation in a multi-user network. All of these problems exhibit structured payoffs which are unimodal functions over partially ordered arms (transmission parameters) as well as unimodal or monotone functions over partially ordered contexts (side-information). Structure over arms helps in reducing the number of arms to be explored, while structure over contexts helps in using past information from nearby contexts to make better selections. We formalize dynamic adaptation of transmission parameters as a structured MAB, and propose frequentist and Bayesian online learning algorithms. We show that both approaches yield logarithmic in time regret. We also investigate dynamic rate and channel adaptation in a cognitive radio network serving heterogeneous applications under dynamically varying channel availability and rate constraints. We formalize the problem as a Bayesian learning problem, and propose a novel learning algorithm which considers each rate-channel pair as a two-dimensional action. The set of available actions varies dynamically over time due to variations in primary user activity and rate requirements of the applications served by the users. Additionally, we extend the work to cater to thescenario when the arms belong to a continuous interval as well as the contexts. Finally, we show via simulations that our algorithms significantly improve the performance in the aforementioned radio resource allocation problems.Item Open Access Decentralized dynamic rate and channel selection over a shared spectrum(IEEE, 2021-03-15) Javanmardi, Alireza; Qureshi, Muhammad Anjum; Tekin, CemWe consider the problem of distributed dynamic rate and channel selection in a multi-user network, in which each user selects a wireless channel and a modulation and coding scheme (corresponds to a transmission rate) in order to maximize the network throughput. We assume that the users are cooperative, however, there is no coordination and communication among them, and the number of users in the system is unknown. We formulate this problem as a multi-player multi-armed bandit problem and propose a decentralized learning algorithm that performs almost optimal exploration of the transmission rates to learn fast. We prove that the regret of our learning algorithm with respect to the optimal allocation increases logarithmically over rounds with a leading term that is logarithmic in the number of transmission rates. Finally, we compare the performance of our learning algorithm with the state-of-the-art via simulations and show that it substantially improves the throughput and minimizes the number of collisions.Item Open Access Exploiting relevance for online decision-making in high-dimensions(IEEE, 2020) Turgay, Eralp; Bulucu, Cem; Tekin, CemMany sequential decision-making tasks require choosing at each decision step the right action out of the vast set of possibilities by extracting actionable intelligence from high-dimensional data streams. Most of the times, the high-dimensionality of actions and data makes learning of the optimal actions by traditional learning methods impracticable. In this work, we investigate how to discover and leverage sparsity in actions and data to enable fast learning. As our learning model, we consider a structured contextual multi-armed bandit (CMAB) with high-dimensional arm (action) and context (data) sets, where the rewards depend only on a few relevant dimensions of the joint context-arm set, possibly in a non-linear way. We depart from the prior work by assuming a high-dimensional, continuum set of arms, and allow relevant context dimensions to vary for each arm. We propose a new online learning algorithm called CMAB with Relevance Learning (CMAB-RL). CMAB-RL enjoys a substantially improved regret bound compared to classical CMAB algorithms whose regrets depend on the number of dimensions dx and da of the context and arm sets. Importantly, we show that when the learner has prior knowledge on sparsity, given in terms of upper bounds d¯¯¯x and d¯¯¯a on the number of relevant context and arm dimensions, then CMAB-RL achieves O~(T1−1/(2+2d¯¯¯x+d¯¯¯a)) regret. Finally, we illustrate how CMAB algorithms can be used for optimal personalized blood glucose control in type 1 diabetes mellitus patients, and show that CMAB-RL outperforms other contextual MAB algorithms in this task.Item Open Access Fast learning for dynamic resource allocation in AI-Enabled radio networks(IEEE, 2020) Qureshi, Muhammad Anjum; Tekin, CemArtificial Intelligence (AI)-enabled radios are expected to enhance the spectral efficiency of 5th generation (5G) millimeter wave (mmWave) networks by learning to optimize network resources. However, allocating resources over the mmWave band is extremely challenging due to rapidly-varying channel conditions. We consider several resource allocation problems for mmWave radio networks under unknown channel statistics and without any channel state information (CSI) feedback: i) dynamic rate selection for an energy harvesting transmitter, ii) dynamic power allocation for heterogeneous applications, and iii) distributed resource allocation in a multi-user network. All of these problems exhibit structured payoffs which are unimodal functions over partially ordered arms (transmission parameters) as well as over partially ordered contexts (side-information). Unimodality over arms helps in reducing the number of arms to be explored, while unimodality over contexts helps in using past information from nearby contexts to make better selections. We model this as a structured reinforcement learning problem, called contextual unimodal multi-armed bandit (MAB), and propose an online learning algorithm that exploits unimodality to optimize the resource allocation over time, and prove that it achieves logarithmic in time regret. Our algorithm's regret scales sublinearly both in the number of arms and contexts for a wide range of scenarios. We also show via simulations that our algorithm significantly improves the performance in the aforementioned resource allocation problems.Item Open Access Fully distributed bandit algorithm for the joint channel and rate selection problem in heterogeneous cognitive radio networks(Bilkent University, 2020-12) Javanmardi, AlirezaWe consider the problem of the distributed sequential channel and rate selection in cognitive radio networks where multiple users choose channels from the same set of available wireless channels and pick modulation and coding schemes (corresponds to transmission rates). In order to maximize the network throughput, users need to be cooperative while communication among them is not allowed. Also, if multiple users select the same channel simultaneously, they collide, and none of them would be able to use the channel for transmission. We rigorously formulate this resource allocation problem as a multi-player multi-armed bandit problem and propose a decentralized learning algorithm called Game of Thrones with Sequential Halving Orthogonal Exploration (GoT-SHOE). The proposed algorithm keeps the number of collisions in the network as low as possible and performs almost optimal exploration of the transmission rates to speed up the learning process. We prove our learning algorithm achieves a regret with respect to the optimal allocation that grows logarithmically over rounds with a leading term that is logarithmic in the number of transmission rates. We also propose an extension of our algorithm which works when the number of users is greater than the number of channels. Moreover, we discuss that Sequential Halving Orthogonal Exploration can indeed be used with any distributed channel assignment algorithm and enhance its performance. Finally, we provide extensive simulations and compare the performance of our learning algorithm with the state-of-the-art which demonstrates the superiority of the proposed algorithm in terms of better system throughput and lower number of collisions.Item Open Access Multi-objective contextual bandits with a dominant objective(IEEE, 2017) Tekin, Cem; Turgay, EralpIn this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems.Item Open Access Online Contextual Influence Maximization in social networks(Institute of Electrical and Electronics Engineers Inc., 2017) Sarıtaç, Ömer; Karakurt, Altuğ; Tekin, CemIn this paper, we propose the Online Contextual Influence Maximization Problem (OCIMP). In OCIMP, the learner faces a series of epochs in each of which a different influence campaign is run to promote a certain product in a given social network. In each epoch, the learner first distributes a limited number of free-samples of the product among a set of seed nodes in the social network. Then, the influence spread process takes place over the network, other users get influenced and purchase the product. The goal of the learner is to maximize the expected total number of influenced users over all epochs. We depart from the prior work in two aspects: (i) the learner does not know how the influence spreads over the network, i.e., it is unaware of the influence probabilities; (ii) influence probabilities depend on the context. We develop a learning algorithm for OCIMP, called Contextual Online INfluence maximization (COIN). COIN can use any approximation algorithm that solves the offline influence maximization problem as a subroutine to obtain the set of seed nodes in each epoch. When the influence probabilities are Hölder continuous functions of the context, we prove that COIN achieves sublinear regret with respect to an approximation oracle that knows the influence probabilities for all contexts. Moreover, our regret bound holds for any sequence of contexts. We also test the performance of COIN on several social networks, and show that it performs better than other methods. © 2016 IEEE.Item Open Access Online contextual influence maximization with costly observations(IEEE, 2019-06) Sarıtaç, Anıl Ömer; Karakurt, Altuğ; Tekin, CemIn the online contextual influence maximization problem with costly observations, the learner faces a series of epochs in each of which a different influence spread process takes place over a network. At the beginning of each epoch, the learner exogenously influences (activates) a set of seed nodes in the network. Then, the influence spread process takes place over the network, through which other nodes get influenced. The learner has the option to observe the spread of influence by paying an observation cost. The goal of the learner is to maximize its cumulative reward, which is defined as the expected total number of influenced nodes over all epochs minus the observation costs. We depart from the prior work in three aspects: 1) the learner does not know how the influence spreads over the network, i.e., it is unaware of the influence probabilities; 2) influence probabilities depend on the context; and 3) observing influence is costly. We consider two different influence observation settings: costly edge-level feedback, in which the learner freely observes the set of influenced nodes, but pays to observe the influence outcomes on the edges of the network; and costly node-level feedback, in which the learner pays to observe whether a node is influenced or not. Since the offline influence maximization problem itself is NP-hard, for these settings, we develop online learning algorithms that use an approximation algorithm as a subroutine to obtain the set of seed nodes in each epoch. When the influence probabilities are Hölder continuous functions of the context, we prove that these algorithms achieve sublinear regret (for any sequence of contexts) with respect to an approximation oracle that knows the influence probabilities for all contexts. Our numerical results on several networks illustrate that the proposed algorithms perform on par with the state-of-the-art methods even when the observations are cost free.Item Open Access Thompson sampling for combinatorial network optimization in unknown environments(IEEE, 2020) Hüyük, Alihan; Tekin, CemInfluence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves O(∑mi=1logT/(piΔi)) regret and O(max{E[mTlogT/p∗−−−−−−−−√],E[m2/p∗]}) Bayesian regret, where m denotes the number of base arms, pi and Δi denote the minimum non-zero triggering probability and the minimum suboptimality gap of base arm i respectively, T denotes the time horizon, and p∗ denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated Lipschitz continuity, CTS achieves O(max{mTlogT−−−−−−√,m2}) Bayesian regret, and when triggering probabilities are non-zero for all base arms, CTS achieves O(1/p∗log(1/p∗)) regret independent of the time horizon. Finally, we numerically compare CTS with algorithms based on upper confidence bounds in several networking problems and show that CTS outperforms these algorithms by at least an order of magnitude in majority of the cases.