Browsing by Subject "Reinforcement learning"

Now showing 1 - 20 of 22

Open Access
A 3D game theoretical framework for the evaluation of unmanned aircraft systems airspace integration concepts
(Elsevier, 2021-10-23) Albaba, Berat Mert; Musavi, Negin; Yıldız, Yıldıray
Predicting the outcomes of integrating Unmanned Aerial System (UAS) into the National Airspace System (NAS) is a complex problem, which is required to be addressed by simulation studies before allowing the routine access of UAS into the NAS. This paper focuses on providing a 3-dimensional (3D) simulation framework using a game-theoretical methodology to evaluate integration concepts using scenarios where manned and unmanned air vehicles co-exist. In the proposed method, the human pilot interactive decision-making process is incorporated into airspace models which can fill the gap in the literature where the pilot behavior is generally assumed to be known a priori. The proposed human pilot behavior is modeled using a dynamic level-k reasoning concept and approximate reinforcement learning. The level-k reasoning concept is a notion in game theory and is based on the assumption that humans have various levels of decision making. In the conventional “static” approach, each agent makes assumptions about his or her opponents and chooses his or her actions accordingly. On the other hand, in the dynamic level-k reasoning, agents can update their beliefs about their opponents and revise their level-k rule. In this study, Neural Fitted Q Iteration, which is an approximate reinforcement learning method, is used to model time-extended decisions of pilots with 3D maneuvers. An analysis of UAS integration is conducted using an Example 3D scenario in the presence of manned aircraft and fully autonomous UAS equipped with sense and avoid algorithms.
Open Access
Autonomous air combat with reinforcement learning under different noise conditions
(IEEE - Institute of Electrical and Electronics Engineers, 2023-08-28) Taşbaş, A. S.; Serbest, S.; Şahin, Safa Onur; Üre, N. K.
The autonomous realization of air combat with reinforcement learning-based methods has recently become a prominent field of study. In this paper, we present a classifier architecture to solve the air combat problem in noisy environments, which is a sub-branch of this field. We collect data from environments with different noise levels using air combat simulation. Using these data, we train three different data sets with the number of state stacks 2, 4, and 8. We train neural network-based classifiers using these datasets. These classifiers adaptively estimate the noise level in the environment at each time step and activate the appropriate pre-trained reinforcement learning policy based on this estimate. In addition, we share the performance comparison of these classifiers in different state stacks.
Open Access
Big-data streaming applications scheduling based on staged multi-armed bandits
(Institute of Electrical and Electronics Engineers, 2016) Kanoun, K.; Tekin, C.; Atienza, D.; Van Der Schaar, M.
Several techniques have been recently proposed to adapt Big-Data streaming applications to existing many core platforms. Among these techniques, online reinforcement learning methods have been proposed that learn how to adapt at run-time the throughput and resources allocated to the various streaming tasks depending on dynamically changing data stream characteristics and the desired applications performance (e.g., accuracy). However, most of state-of-the-art techniques consider only one single stream input in its application model input and assume that the system knows the amount of resources to allocate to each task to achieve a desired performance. To address these limitations, in this paper we propose a new systematic and efficient methodology and associated algorithms for online learning and energy-efficient scheduling of Big-Data streaming applications with multiple streams on many core systems with resource constraints. We formalize the problem of multi-stream scheduling as a staged decision problem in which the performance obtained for various resource allocations is unknown. The proposed scheduling methodology uses a novel class of online adaptive learning techniques which we refer to as staged multi-armed bandits (S-MAB). Our scheduler is able to learn online which processing method to assign to each stream and how to allocate its resources over time in order to maximize the performance on the fly, at run-time, without having access to any offline information. The proposed scheduler, applied on a face detection streaming application and without using any offline information, is able to achieve similar performance compared to an optimal semi-online solution that has full knowledge of the input stream where the differences in throughput, observed quality, resource usage and energy efficiency are less than 1, 0.3, 0.2 and 4 percent respectively.
Open Access
Contact energy based hindsight experience prioritization
(IEEE, 2024-08-08) Sayar, Erdi; Bing, Zhenshan; D'Eramo, Carlo; Öğüz, Salih Özgür; Knoll, Alois
Multi-goal robot manipulation tasks with sparse rewards are difficult for reinforcement learning (RL) algorithms due to the inefficiency in collecting successful experiences. Recent algorithms such as Hindsight Experience Replay (HER) expedite learning by taking advantage of failed trajectories and replacing the desired goal with one of the achieved states so that any failed trajectory can be utilized as a contribution to learning. However, HER uniformly chooses failed trajectories, without taking into account which ones might be the most valuable for learning. In this paper, we address this problem and propose a novel approach Contact Energy Based Prioritization (CEBP) to select the samples from the replay buffer based on rich information due to contact, leveraging the touch sensors in the gripper of the robot and object displacement. Our prioritization scheme favors sampling of contact-rich experiences, which are arguably the ones providing the largest amount of information. We evaluate our proposed approach on various sparse reward robotic tasks and compare it with the state-of-the-art methods. We show that our method surpasses or performs on par with those methods on robot manipulation tasks. Finally, we deploy the trained policy from our method to a real Franka robot for a pick-and-place task. We observe that the robot can solve the task successfully. The videos and code are publicly available at: https://erdiphd.github.io/HER force/.
Open Access
Do players learn how to learn? : evidence from conctant sum games with varying number of actions
(2009) Saraçgil, İhsan Erman
This thesis investigates the learning behaviour of individuals in strategic environments that have different complexity levels. A new experiment is conducted in which ascending or descending series of constant sum games are played by subjects and the experimental data including both stated beliefs and actual plays are used to estimate which learning model explains the subjects’ behaviour best within and across these games. Taking learning rules that model the opponent as a learning agent and heterogeneity of the population into consideration, the estimation results support that people switch learning rules across games and use different models in different games. This game-dependency is confirmed by both action, beliefs and the joint estimations. Although their likelihoods vary from game to game, best response to uniform beliefs and reinforcement learning are the most commonly used learning rules in the four games considered in the experiment, while fictitious play and iterations on that are rare instances observed only in estimation by stated beliefs. Despite the change across games, there is no significant link between complexity of the game and the cognitive hierarchy of learning models. Belief statements and best response behaviour also differ across games as we observepeople making smoother guesses in large action games and more dispersed beliefs statements in small action games. Inconsistency between actions and stated beliefs is stronger in large action games. The evidence strongly supports that learning and belief formation are both game-dependent.
Open Access
Driver modeling using a continuous policy space: theory and traffic data validation
(Institute of Electrical and Electronics Engineers, 2023-11-16) Yaldiz, C. O.; Yıldız, Yıldıray
In this article, we present a continuous-policy-space game theoretical method for modeling human driver interactions on highway traffic. The proposed method is based on Gaussian Processes and developed as a refinement of the hierarchical decision-making concept called “level- k reasoning” that conventionally assigns discrete levels of behaviors to agents. Conventional level- k reasoning approach may pose undesired constraints for predicting human decision making due to a limited number (usually 2 or 3) of driver policies it provides. To fill this gap in the literature, we expand the framework to a continuous domain that enables a continuous-policy-space, consisting of infinitely many driver policies. Through the approach detailed in this article, more accurate and realistic driver models can be obtained and employed for creating high-fidelity simulation platforms for the validation of autonomous vehicle control algorithms. We validate the proposed method on a traffic dataset and compare it with the conventional level- k approach to demonstrate its contributions and implications.
Open Access
Dynamic capacity management for voice over packet networks
(2003-06-07) Akar, Nail; Şahin, Cem
In this paper, dynamic capacity management refers to the process of dynamically changing the capacity allocation (reservation) of a pseudo-wire established between two network end points. This process is based on certain criteria including instantaneous traffic load for the pseudo-wire, network utilization, time of day, or day of week. Frequent adjustment of the capacity yields a scalability issue in the form of a significant amount of message processing in the network elements involved in the capacity update process. On the other hand, if the capacity is adjusted once and for the worst possible traffic conditions, a significant amount of bandwidth may be wasted depending on the actual traffic load. There is then a need for dynamic capacity management that takes into account the tradeoff between scalability and bandwidth efficiency. This problem is motivated by voice over packet networks in which end-to-end reservation requests are initiated by PSTN voice calls and these reservations are aggregated into one signal reservation in the core packet network for scalability. In this paper, we introduce a Markov decision framework for an optimal reservation aggregation scheme for voice over packet networks. Moreover, for problems with large sizes, we provide a suboptimal scheme using reinforcement learning. We show a significant improvement in bandwidth efficiency in voice over packet networks using aggregate reservations. © 2003 IEEE.
Open Access
Facial feedback for reinforcement learning: A case study and ofine analysis using the TAMER framework
(Springer, 2020-02) Li, G.; Dibeklioğlu, Hamdi; Whiteson, S.; Hung, H.
Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this article, we investigate the potential of agent learning from trainers’ facial expressions via interpreting them as evaluative feedback. To do so, we implemented TAMER which is a popular interactive reinforcement learning method in a reinforcement-learning benchmark problem—Infinite Mario, and conducted the first large-scale study of TAMER involving 561 participants. With designed CNN–RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback using facial expressions. In addition, our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible and using strong/effective prediction models or a regression method, facial responses would significantly improve the performance of agents. Furthermore, our experiment supports previous studies demonstrating the importance of bi-directional feedback and competitive elements in the training interface.
Open Access
Functional contour-following via haptic perception and reinforcement learning
(Institute of Electrical and Electronics Engineers, 2018) Hellman, R. B.; Tekin, Cem; Schaar, M. V.; Santos, V. J.
Many tasks involve the fine manipulation of objects despite limited visual feedback. In such scenarios, tactile and proprioceptive feedback can be leveraged for task completion. We present an approach for real-time haptic perception and decision-making for a haptics-driven, functional contour-following task: The closure of a ziplock bag. This task is challenging for robots because the bag is deformable, transparent, and visually occluded by artificial fingertip sensors that are also compliant. A deep neural net classifier was trained to estimate the state of a zipper within a robot's pinch grasp. A Contextual Multi-Armed Bandit (C-MAB) reinforcement learning algorithm was implemented to maximize cumulative rewards by balancing exploration versus exploitation of the state-action space. The C-MAB learner outperformed a benchmark Q-learner by more efficiently exploring the state-action space while learning a hard-to-code task. The learned C-MAB policy was tested with novel ziplock bag scenarios and contours (wire, rope). Importantly, this work contributes to the development of reinforcement learning approaches that account for limited resources such as hardware life and researcher time. As robots are used to perform complex, physically interactive tasks in unstructured or unmodeled environments, it becomes important to develop methods that enable efficient and effective learning with physical testbeds.
Open Access
A game theoretical framework for the evaluation of unmanned aircraft systems airspace integration concepts
(2017-07) Musavi, Neginsadat
Predicting the outcomes of integrating Unmanned Aerial Systems (UAS) into the National Aerospace (NAS) is a complex problem which is required to be addressed by simulation studies before allowing the routine access of UAS into the NAS. This thesis focuses on providing 2D and 3D simulation frameworks using a game theoretical methodology to evaluate integration concepts in scenarios where manned and unmanned air vehicles co-exist. The fundamental gap in the literature is that the models of interaction between manned and unmanned vehicles are insu cient: a) they assume that pilot behavior is known a priori and b) they disregard decision making processes. The contribution of this work is to propose a modeling framework, in which, human pilot reactions are modeled using reinforcement learning and a game theoretical concept called level-k reasoning to ll this gap. The level-k reasoning concept is based on the assumption that humans have various levels of decision making. Reinforcement learning is a mathematical learning method that is rooted in human learning. In this work, a classical and an approximate reinforcement learning (Neural Fitted Q Iteration) methods are used to model time-extended decisions of pilots with 2D and 3D maneuvers. An analysis of UAS integration is conducted using example scenarios in the presence of manned aircraft and fully autonomous UAS equipped with sense and avoid algorithms.
Open Access
A game theoretical modeling and simulation framework for the integration of unmanned aircraft systems in to the national airspace
(AIAA, 2016) Musavi, Negin; Tekelioğlu, K. B.; Yıldız, Yıldıray; Güneş, Kerem; Onural, Deniz
The focus of this paper is to present a game theoretical modeling and simulation frame- work for the integration of Unmanned Aircraft Systems (UAS) into the National Airspace system (NAS). The problem of predicting the outcome of complex scenarios, where UAS and manned air vehicles co-exist, is the research problem of this work. The fundamental gap in the literature in terms of developing models for UAS integration into NAS is that the models of interaction between manned and unmanned vehicles are insufficient. These models are insufficient because a) they assume that human behavior is known a priori and b) they disregard human reaction and decision making process. The contribution of this paper is proposing a realistic modeling and simulation framework that will fill this gap in the literature. The foundations of the proposed modeling method is formed by game theory, which analyzes strategic decision making between intelligent agents, bounded rationality concept, which is based on the fact that humans cannot always make perfect decisions, and reinforcement learning, which is shown to be effective in human behavior in psychology literature. These concepts are used to develop a simulator which can be used to obtain the outcomes of scenarios consisting of UAS, manned vehicles, automation and their interactions. An analysis of the UAS integration is done with a specifically designed scenario for this paper. In the scenario, a UAS equipped with sense and avoid algorithm, moves along a predefined trajectory in a crowded airspace. Then the effect of various system parameters on the safety and performance of the overall system is investigated.
Open Access
Jamming bandits-a novel learning method for optimal jamming
(Institute of Electrical and Electronics Engineers Inc., 2016) Amuru, S.; Tekin, C.; Van Der Schaar, M.; Buehrer, R.M.
Can an intelligent jammer learn and adapt to unknown environments in an electronic warfare-type scenario? In this paper, we answer this question in the positive, by developing a cognitive jammer that adaptively and optimally disrupts the communication between a victim transmitter-receiver pair. We formalize the problem using a multiarmed bandit framework where the jammer can choose various physical layer parameters such as the signaling scheme, power level and the on-off/pulsing duration in an attempt to obtain power efficient jamming strategies. We first present online learning algorithms to maximize the jamming efficacy against static transmitter-receiver pairs and prove that these algorithms converge to the optimal (in terms of the error rate inflicted at the victim and the energy used) jamming strategy. Even more importantly, we prove that the rate of convergence to the optimal jamming strategy is sublinear, i.e., the learning is fast in comparison to existing reinforcement learning algorithms, which is particularly important in dynamically changing wireless environments. Also, we characterize the performance of the proposed bandit-based learning algorithm against multiple static and adaptive transmitter-receiver pairs.
Open Access
Modeling cyber-physical human systems via an interplay between reinforcement learning and game theory
(Elsevier, 2019) Albaba, Berat Mert; Yıldız, Yıldıray
Predicting the outcomes of cyber-physical systems with multiple human interactions is a challenging problem. This article reviews a game theoretical approach to address this issue, where reinforcement learning is employed to predict the time-extended interaction dynamics. We explain that the most attractive feature of the method is proposing a computationally feasible approach to simultaneously model multiple humans as decision makers, instead of determining the decision dynamics of the intelligent agent of interest and forcing the others to obey certain kinematic and dynamic constraints imposed by the environment. We present two recent exploitations of the method to model (1) unmanned aircraft integration into the National Airspace System and (2) highway traffic. We conclude the article by providing ongoing and future work about employing, improving and validating the method. We also provide related open problems and research opportunities.
Open Access
Novel sampling strategies for experience replay mechanisms in off-policy deep reinforcement learning algorithms
(2024-09) Mutlu, Furkan Burak
Experience replay enables agents to effectively utilize their past experiences repeatedly to improve learning performance. Traditional strategies, such as vanilla experience replay, involve uniformly sampling from the replay buffer, which can lead to inefficiencies as they do not account for the varying importance of different transitions. More advanced methods, like Prioritized Experience Replay (PER), attempt to address this by adjusting the sampling probability of each transition according to its perceived importance. However, constantly recalculating these probabilities for every transition in the buffer after each iteration is computationally expensive and impractical for large-scale applications. Moreover, these methods do not necessarily enhance the performance of actor-critic-based reinforcement learning algorithms, as they typically rely on predefined metrics, such as Temporal Difference (TD) error, which do not directly represent the relevance of a transition to the agent’s policy. The importance of a transition can change dynamically throughout training, but existing approaches struggle to adapt to this due to computational constraints. Both vanilla sampling strategies and advanced methods like PER introduce biases toward certain transitions. Vanilla experience replay tends to favor older transitions, which may no longer be useful since they were often generated by a random policy during initialization. Meanwhile, PER is biased toward transitions with high TD errors, which primarily reflects errors in the critic network and may not correspond to improvements in the policy network, as there is no direct correlation between TD error and policy enhancement. Given these challenges, we propose a new sampling strategy designed to mitigate bias and ensure that every transition is used in updates an equal number of times. Our method, Corrected Uniform Experience Replay (CUER), leverages an efficient sum-tree structure to achieve fair sampling counts for all transitions. We evaluate CUER on various continuous control tasks and demonstrate that it outperforms both traditional and advanced replay mechanisms when applied to state-of-the-art off-policy deep reinforcement learning algorithms like TD3 and SAC. Empirical results indicate that CUER consistently improves sample efficiency without imposing a significant computational burden, leading to faster convergence and more stable learning performance.
Open Access
The performance comparison of different training strategies for reinforcement learning on DeepRTS
(IEEE, 2022-08-29) Şahin, Safa Onur; Yücesoy, Veysel
In this paper, we train reinforcement learning agents on the game of DeepRTS under different training strategies, which are i) training against rule based agents, ii) self-training and iii) training by adversarial attack to another agent. We perform certain modifications on the DeepRTS game and the reinforcement learning framework to make it closer to real life decision making problems. For this purpose, we allow agents take macro actions based on human heuristics, where these actions may last multiple time steps and the durations for these actions may differ from each other. In addition, the agents simultaneously take actions for each available unit at a time step. We train the reinforcement learning based agents under three different training strategies and we provide a detailed performance analysis of these agents against several reference agents.
Open Access
Predicting human behavior using static and dynamic models
(2021-08) Albaba, Berat Mert
Modeling human behavior is a challenging problem and it is necessary for the safe integration of autonomous systems into daily life. This thesis focuses on modeling human behavior through static and dynamic models. The ﬁrst contribution of this thesis is a stochastic modeling framework, which is a synergistic combination of a static iterated reasoning approach and deep reinforcement learning. Using statistical goodness of ﬁt tests, the proposed approach is shown to accurately predict human driver behavior in highway scenarios. Although human driver behavior are modeled successfully with the static model, the scope of interactions that can be modeled with this approach is limited to short duration interactions. For interactions that are long enough to induce adaptive behavior, we need models that incorporate learning. The second contribution of this thesis is a learning model for time extended human-human interactions. Through a hierarchical reasoning solution approach, equilibrium concepts are combined with Gaussian Processes to predict the learning behavior. As a result, a novel bounded rational learning model is proposed.
Open Access
Predicting pilot behavior in medium-scale scenarios using game theory and reinforcement learning
(American Institute of Aeronautics and Astronautics Inc., 2014) Yildiz, Y.; Agogino, A.; Brat, G.
A key element to meet the continuing growth in air traffic is the increased use of automation. Decision support systems, computer-based information acquisition, trajectory planning systems, high-level graphic display systems, and all advisory systems are considered to be automation components related to next generation (NextGen) air space. Given a set of goals represented as reward functions, the actions of the players may be predicted. However, several challenges need to be overcome. First, determining how a player can attempt to maximize their reward function can be a difficult inverse problem. Second, players may not be able to perfectly maximize their reward functions. ADS-B technology can provide pilots the information, position, velocity, etc. of other aircraft. However, a pilot has limited ability to use all this information for his/her decision making. For this scenario, the authors model these pilot limitations by assuming that pilots can observe a limited section of the grid in front of them.
Open Access
Q-Learning for MDPs with general spaces: convergence and near optimality via quantization under weak continuity
(Journal of Machine Learning Research, 2023-07-12) Kara, A. D.; Saldı, Naci; Yüksel, S.
Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and further-more this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
Open Access
Reinforcement-learning-based job-shop scheduling for intelligent intersection management
(IEEE, 2023-06-02) Huang, Shao-Ching; Lin, Kai-En; Kuo, Cheng-Yen; Lin, Li-Heng; Sayın, Muhammed Ömer; Lin, Chung-Wei
The goal of intersection management is to organize vehicles to pass the intersection safely and efficiently. Due to the technical advance of connected and autonomous vehicles, intersection management becomes more intelligent and potentially unsignalized. In this paper, we propose a reinforcement-learning-based methodology to train a centralized intersection manager. We define the intersection scheduling problem with a graph-based model and transform it to the job-shop scheduling problem (JSSP) with additional constraints. To utilize reinforcement learning, we model the scheduling procedure as a Markov decision process (MDP) and train the agent with the proximal policy optimization (PPO). A grouping strategy is also developed to apply the trained model to streams of vehicles. Experimental results show that the learning-based intersection manager is especially effective with high traffic densities. This paper is the first work in the literature to apply reinforcement learning on the graph-based intersection model. The proposed methodology can flexibly deal with any conflicting scenario and indicate the applicability of reinforcement learning to Intelligent intersection management.
Open Access
Strategizing against q-learners: a control-theoretical approach
(Institute of Electrical and Electronics Engineers, 2024-06-18) Arslantaş, Yüksel; Yüceel, Ege; Sayın, Muhammed O.
In this letter, we explore the susceptibility of the independent Q-learning algorithms (a classical and widely used multi-agent reinforcement learning method) to strategic manipulation of sophisticated opponents in normal-form games played repeatedly. We quantify how much strategically sophisticated agents can exploit naive Q-learners if they know the opponents' Q-learning algorithm. To this end, we formulate the strategic actors' interactions as a stochastic game (whose state encompasses Q-function estimates of the Q-learners) as if the Q-learning algorithms are the underlying dynamical system. We also present a quantization-based approximation scheme to tackle the continuum state space and analyze its performance for two competing strategic actors and a single strategic actor both analytically and numerically.