Browsing by Subject "Online Learning"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
Item Open Access Algorithms and regret bounds for multi-objective contextual bandits with similarity information(Bilkent University, 2019-01) Turğay, EralpContextual bandit algorithms have been shown to be e ective in solving sequential decision making problems under uncertain environments, ranging from cognitive radio networks to recommender systems to medical diagnosis. Many of these real world applications involve multiple and possibly con icting objectives. In this thesis, we consider an extension of contextual bandits called multi-objective contextual bandits with similarity information. Unlike single-objective contextual bandits, in which the learner obtains a random scalar reward for each arm it selects, in the multi-objective contextual bandits, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the learner at the beginning of each round. For this setting, rst, we propose a new multi-objective contextual multi-armed bandit problem with similarity information that has two objectives, where one of the objectives dominates the other objective. Here, the goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its total reward in the dominant objective. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and de ne two performance measures: the 2-dimensional (2D) regret and the Pareto regret. We show that both the 2D regret and the Pareto regret of MOC-MAB are sublinear in the number of rounds. We also evaluate the performance of MOC-MAB in synthetic and real-world datasets. In the next problem, we consider a multi-objective contextual bandit problem with an arbitrary number of objectives and a highdimensional, possibly uncountable arm set, which is endowed with the similarity information. We propose an online learning algorithm called Pareto Contextual Zooming (PCZ), and prove that it achieves sublinear in the number of rounds Pareto regret, which is near-optimal.Item Open Access An asymptotically optimal solution for contextual bandit problem in adversarial setting(Bilkent University, 2018-05) Mohaghegh Neyshabouri, MohammadrezaWe propose online algorithms for sequential learning in the contextual multiarmed bandit setting. Our approach is to partition the context space and then optimally combine all of the possible mappings between the partition regions and the set of bandit arms in a data driven manner. We show that in our approach, the best mapping is able to approximate the best arm selection policy to any desired degree under mild Lipschitz conditions. Therefore, we design our algorithms based on the optimal adaptive combination and asymptotically achieve the performance of the best mapping as well as the best arm selection policy. This optimality is also guaranteed to hold even in adversarial environments since we do not rely on any statistical assumptions regarding the contexts or the loss of the bandit arms. Moreover, we design e cient implementations for our algorithms in various hierarchical partitioning structures such as lexicographical or arbitrary position splitting and binary trees (and several other partitioning examples). For instance, in the case of binary tree partitioning, the computational complexity is only log-linear in the number of regions in the nest partition. In conclusion, we provide signi cant performance improvements by introducing upper bounds (w.r.t. the best arm selection policy) that are mathematically proven to vanish in the average loss per round sense at a faster rate compared to the state-of-theart. Our experimental work extensively covers various scenarios ranging from bandit settings to multi-class classi cation with real and synthetic data. In these experiments, we show that our algorithms are highly superior over the stateof- the-art techniques while maintaining the introduced mathematical guarantees and a computationally decent scalability.Item Open Access Low complexity efficient online learning algorithms using LSTM networks(Bilkent University, 2018-12) Mirza, Ali HassanIn this thesis, we implement efficient online learning algorithms using the Long Short Term Memory (LSTM) networks with low time and computational complexity. In Chapter 2, we investigate efficient covariance information-based online learning using the LSTM networks known as Co-LSTM networks. We utilize the covariance information into the LSTM gating structure and propose various effi- cient models. We reduce the computational complexity by applying the Weight Matrix Factorization (WMF) trick and derive the additive gradient based updates. In Chapter 3, we give a practical application of the network intrusion detection using the Co-LSTM networks. In Chapter 4, we propose a boosted binary version of Tree-LSTM networks which we call BBT-LSTM networks. We introduce the depth and windowing factor into the N-ary Tree-LSTM networks where each LSTM node is binarily split and the whole tree architecture grows in a balanced manner. In order to reduce the computational complexity of the BBT-LSTM networks, we apply WMF trick, replace the regular multiplication operator with the energy efficient operator and finally introduce the slicing operation on the BBT-LSTM network weight matrices. In Chapter 5, we propose another low complexity LSTM network based on a minimum number of hopping over the input data sequence. We study two methods to select the appropriate value of the hopping distance. Through an extensive set of experiments using the real-life data sets, we demonstrate the significant increase in the performance of the proposed algorithms at the end of each chapter.Item Open Access Online learning in structured Markov decision processes(Bilkent University, 2017-07) Akbarzadeh, NimaThis thesis proposes three new multi-armed bandit problems, in which the learner proceeds in a sequence of rounds where each round is a Markov Decision Process (MDP). The learner's goal is to maximize its cumulative reward without any a priori knowledge on the state transition probabilities. The rst problem considers an MDP with sorted states and a continuation action that moves the learner to an adjacent state; and a terminal action that moves the learner to a terminal state (goal or dead-end state). In this problem, a round ends and the next round starts when a terminal state is reached, and the aim of the learner in each round is to reach the goal state. First, the structure of the optimal policy is derived. Then, the regret of the learner with respect to an oracle, who takes optimal actions in each round is de ned, and a learning algorithm that exploits the structure of the optimal policy is proposed. Finally, it is shown that the regret either increases logarithmically over rounds or becomes bounded. In the second problem, we investigate the personalization of a clinical treatment. This process is modeled as a goal-oriented MDP with dead-end states. Moreover, the state transition probabilities of the MDP depends on the context of the patients. An algorithm that uses the rule of optimism in face of uncertainty is proposed to maximize the number of rounds in which the goal state is reached. In the third problem, we propose an online learning algorithm for optimal execution in the limit order book of a nancial asset. Given a certain amount of shares to sell and an allocated time to complete the transaction, the proposed algorithm dynamically learns the optimal number of shares to sell at each time slot of the allocated time. We model this problem as an MDP, and derive the form of the optimal policy.Item Open Access Online learning under adverse settings(Bilkent University, 2015-05) Özkan, HüseyinWe present novel solutions for contemporary real life applications that generate data at unforeseen rates in unpredictable forms including non-stationarity, corruptions, missing/mixed attributes and high dimensionality. In particular, we introduce novel algorithms for online learning, where the observations are received sequentially and processed only once without being stored, under adverse settings: i) no or limited assumptions can be made about the data source, ii) the observations can be corrupted and iii) the data is to be processed at extremely fast rates. The introduced algorithms are highly effective and efficient with strong mathematical guarantees; and are shown, through the presented comprehensive real life experiments, to significantly outperform the competitors under such adverse conditions. We develop a novel highly dynamical ensemble method without any stochastic assumptions on the data source. The presented method is asymptotically guaranteed to perform as well as, i.e., competitive against, the best expert in the ensemble, where the competitor, i.e., the best expert, itself is also specifically designed to continuously improve over time in a completely data adaptive manner. In addition, our algorithm achieves a significantly superior modeling power (hence, a significantly superior prediction performance) through a hierarchical and self-organizing approach while mitigating over training issues by combining (taking finite unions of) low-complexity methods. On the contrary, the state-of-the-art ensemble techniques are heavily dependent on static and unstructured expert ensembles. In this regard, we rigorously solve the resulting issues such as the over sensitivity to source statistics as well as the incompatibility between the modeling power and the computational load/precision. Our results uniformly hold for every possible input stream in the deterministic sense regardless of the stationary or non-stationary source statistics. Furthermore, we directly address the data corruptions by developing novel versatile imputation methods and thoroughly demonstrate that the anomaly detection -in addition to being stand alone an important learning problem- is extremely effective for corruption detection/imputation purposes. To that end, as the first time in the literature, we develop the online implementation of the Neyman-Pearson characterization for anomalies in stationary or non-stationary fast streaming temporal data. The introduced anomaly detection algorithm maximizes the detection power at a specified controllable constant false alarm rate with no parameter tuning in a truly online manner. Our algorithms can process any streaming data at extremely fast rates without requiring a training phase or a priori information while bearing strong performance guarantees. Through extensive experiments over real/synthetic benchmark data sets, we also show that our algorithms significantly outperform the state-of-the-art as well as the most recently proposed techniques in the literature with remarkable adaptation capabilities to non-stationarity.Item Open Access Online learning with recurrent neural networks(Bilkent University, 2018-07) Ergen, TolgaIn this thesis, we study online learning with Recurrent Neural Networks (RNNs). Particularly, in Chapter 2, we investigate online nonlinear regression and introduce novel regression structures based on the Long Short Term Memory (LSTM) network, i.e., is an advanced RNN architecture. To train these novel LSTM based structures, we introduce highly e cient and e ective Particle Filtering (PF) based updates. We also provide Stochastic Gradient Descent (SGD) and Extended Kalman Filter (EKF) based updates. Our PF based training method guarantees convergence to the optimal parameter estimation in the Mean Square Error (MSE) sense. In Chapter 3, we investigate online training of LSTM architectures in a distributed network of nodes, where each node employs an LSTM based structure for online regression. We rst provide a generic LSTM based regression structure for each node. In order to train this structure, we introduce a highly e ective and e cient Distributed PF (DPF) based training algorithm. We also introduce a Distributed EKF (DEKF) based training algorithm. Here, our DPF based training algorithm guarantees convergence to the performance of the optimal centralized LSTM parameters in the MSE sense. In Chapter 4, we investigate variable length data regression in an online setting and introduce an energy e cient regression structure build on LSTM networks. To reduce the complexity of this structure, we rst replace the regular multiplication operations with an energy e cient operator. We then apply factorizations to the weight matrices so that the total number of parameters to be trained is signi cantly reduced. We then introduce online training algorithms. Through a set of experiments, we illustrate signi cant performance gains and complexity reductions achieved by the introduced algorithms with respect to the state of the art methods.Item Open Access Online minimax optimal density estimation and anomaly detection in nonstationary environments(Bilkent University, 2017-07) Gökcesu, KaanOnline anomaly detection has attracted signi cant attention in recent years due to its applications in network monitoring, cybersecurity, surveillance and sensor failure. To this end, we introduce an algorithm that sequentially processes data to detect anomalies in time series. Our algorithm consists of two stages: density estimation and anomaly detection. First, we construct a probability density function to model the normal data. Then, we threshold the density of the newly observed data to detect anomalies. We approach this problem from an information theoretic perspective and, for the rst time in the literature, propose minimax optimal schemes for both stages to create an optimal anomaly detection algorithm in a strong deterministic sense. For the rst stage, we introduce an online density estimator that is minimax optimal for general nonstationary exponential-family of distributions without any assumptions on the observation sequence. Our algorithm does not require a priori knowledge of the time horizon, the drift of the underlying distribution or the time instances the parameters of the source changes. Our results are guaranteed to hold in an individual sequence manner. For the second stage, we propose an online threshold selection scheme that has logarithmic performance bounds against the best threshold chosen in hindsight. Our complete algorithm adaptively updates its parameters in a truly sequential manner to achieve log-linear regrets in both stages. Because of its universal prediction perspective on its density estimation, our anomaly detection algorithm can be used in unsupervised, semi-supervised and supervised manner. Through synthetic and real life experiments, we demonstrate substantial performance gains with respect to the state-of-the-art.Item Open Access Personalizing treatments via contextual multi-armed bandits by identifying relevance(Bilkent University, 2019-08) Bulucu, CemPersonalized medicine offers specialized treatment options for individuals which is vital as every patient is different. One-size-fits-all approaches are often not effective and most patients require personalized care when dealing with various diseases like cancer, heart diseases or diabetes. As vast amounts of data became available in medicine (and otherfields including web-based recommender systems and intelligent radio networks), online learning approaches are gaining popularity due to their ability to learn fast in uncertain environments. Contextual multi-armed bandit algorithms provide reliable sequential decision-making options in such applications. In medical settings (also in other aforementioned settings), data (contexts) and actions (arms) are often high-dimensional and performances of traditional contextual multi-armed bandit approaches are almost as bad as random selection, due to the curse of dimensionality. Fortunately, in many cases the information relevant to the decision-making task does not depend on all dimensions but rather depends on a small subset of dimensions, called the relevant dimensions. In this thesis, we aim to provide personalized treatments for patients sequentially arriving over time by using contextual multi-armed bandit approaches when the expected rewards related to patient outcomes only vary on a small subset of context and arm dimensions. For this purpose,first we make use of the contextual multi-armed bandit with relevance learning (CMAB-RL) algorithm which learns the relevance by employing a novel partitioning strategy on the context-arm space and forming a set of candidate relevant dimension tuples. In this model, the set of relevant patient traits are allowed to be different for different bolus insulin dosages. Next, we consider an environment where the expected reward function defined over the context-arm space is sampled from a Gaussian process. For this setting, we propose an extension to the contextual Gaussian process upper confidence bound (CGP-UCB) algorithm, called CGP-UCB with relevance learning (CGP-UCB-RL), that learns the relevance by integrating kernels that allow weights to be associated with each dimension and optimizing the negative log marginal likelihood. Then, we investigate the suitability of this approach in the blood glucose regulation problem. Aside from applying both algorithms to the bolus insulin administration problem, we also evaluate their performance in synthetically generated environments as benchmarks.Item Open Access Prediction with expert advice: on the role of contexts, bandit feedback and risk-awareness(Bilkent University, 2018-12) Ekşioğlu, KubilayAlong with the rapid growth in the size of data generated and collected over time, the need for developing online algorithms that can provide answers without any offline training has considerably increased. In this thesis, we consider the prediction with expert advice problem under the online learning framework. Specifically, we consider problems where experts have asymmetric information about the sample space. First, we propose an algorithm that selects a subset of the experts and makes predictions based on the advices of this subset. Then, we propose another algorithm that clusters samples in an online manner and makes predictions based on the history of observations and decisions within each cluster. Next, we consider the Safe Bandit, a variant of the Risk Aware Multi Armed Bandit, where the goal is to minimize the number of rounds in which a risky arm is chosen. Adopting mean-variance as the risk notion, we define an arm as risky if its mean-variance is higher than a given threshold. Using this, we define a new regret measure called Risk Violation Regret (RVR), which depends on the number of times risky arms are selected. Then, we propose a learning algorithm called Exploration and Exploitation with Risk Thresholds (EXERT), and prove that it achieves O(1) RVR with high probability. Afterwards, we use EXERT in an expert selection problem, where each expert corresponds to a neural network with reject option. For this, we propose a method to train these neural networks and use them to evaluate the performance of EXERT in real-world datasets.