Browsing by Subject "Deep reinforcement learning"

Now showing 1 - 13 of 13

Open Access
Actor prioritized experience replay
(AI Access Foundation, 2023-11-16) Sağlam, B.; Mutlu, Furkan Burak; Cicek, Dogan C.; Kozat, Süleyman S.
A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.
Open Access
Deep intrinsically motivated exploration in continuous control
(Springer, 2023-10-26) Sağlam, Baturay; Kozat, Süleyman Serdar
In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alsetup and hyper-parameterternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function’s error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.
Open Access
Deep reinforcement learning based joint downlink beamforming and RIS configuration in RIS-aided MU-MISO systems under hardware impairments and imperfect CSI
(IEEE, 2023-10-23) Sağlam, Baturay; Güngörlüoğlu, D.; Kozat, Süleyman Serdar
We introduce a novel deep reinforcement learning (DRL) approach to jointly optimize transmit beamforming and reconfigurable intelligent surface (RIS) phase shifts in a multiuser multiple input single output (MU-MISO) system to maximize the sum downlink rate under the phase-dependent reflection amplitude model. Our approach addresses the challenge of imperfect channel state information (CSI) and hardware impairments by considering a practical RIS amplitude model. We compare the performance of our approach against a vanilla DRL agent in two scenarios: perfect CSI and phase-dependent RIS amplitudes, and mismatched CSI and ideal RIS reflections. The results demonstrate that the proposed framework significantly outperforms the vanilla DRL agent under mismatch and approaches the golden standard. Our contributions include modifications to the DRL approach to address the joint design of transmit beamforming and phase shifts and the phase-dependent amplitude model. To the best of our knowledge, our method is the first DRL-based approach for the phase-dependent reflection amplitude model in RIS-aided MU-MISO systems. Our findings in this study highlight the potential of our approach as a promising solution to overcome hardware impairments in RIS-aided wireless communication systems.
Open Access
Deep reinforcement learning for urban modeling: morphogenesis simulation of self-organized settlements
(2023-07) H'sain, Houssame Eddine
Self-organized modes of urban growth could result in high-quality urban space and have notable benefits such as providing affordable housing and wider access to economic opportunities within cities. Modeling this non-linear, complex, and dynamic sequential urban aggregation process requires adaptive sequential decision-making. In this study, a deep reinforcement learning (DRL) approach is proposed to automatically learn these adaptive decision policies to generate self-organized settlements that maximize a certain performance objective. A framework to formulate the self-organized settlement morphogenesis problem as single-agent reinforcement learning (RL) environment is presented. This framework is then verified by developing three environments based on two cellular automata urban growth models and training RL agents using the Deep Q-learning (DQN) and Proximal Policy Optimization (PPO) algorithms to learn sequential urban aggregation policies that maximize performance metrics within those environments. The agents consistently learn to sequentially grow the settlements while adapting their morphology to maximize performance, maintain right-of-way, and adapt to topographic constraints. The method proposed in this study can be used not only to model self-organized settlement growth based on preset performance objectives but also could be generalized to solve various single-agent sequential decision-making generative design problems.
Open Access
Improving experience replay architecture with K-Means clustering
(IEEE - Institute of Electrical and Electronics Engineers, 2023-08-28) Serbest, S.; Taşbaş, A. S.; Şahin, Safa Onur
Replay memory highly affects the performance of deep reinforcement learning algorithms in terms of data efficiency and training time. How the experiences will be stored in the memory and sampling will be realized are subjects of ongoing research in the field. In this paper, a new replay memory module, called K-Means Replay Memory is designed. The module consists of two submodules called Recent Memory and Global Memory. New experiences are inserted only into recent memory and when the number of experiences in recent memory exceeds a certain limit, experience share occurs from recent memory to global memory. After the experience share, similarity sets are constituted via K-Means clustering algorithm within the stored experiences. While sampling, the distribution of experiences sampled from recent memory with respect to similarity sets and average losses obtained from neural networks are taken into account in order to compute set probabilities. Experiences are sampled from global memory by using these probabilities. Experiments are performed by using Prioritized Experience Replay, Uniform Experience Replay and K-Means Replay Memory, and obtained results are given in this paper.
Open Access
Improving the performance of Batch-Constrained reinforcement learning in continuous action domains via generative adversarial networks
(IEEE, 2022-08-29) Sağlam, Baturay; Dalmaz, Onat; Gönç, Kaan; Kozat, Süleyman S.
The Batch-Constrained Q-learning algorithm is shown to overcome the extrapolation error and enable deep reinforcement learning agents to learn from a previously collected fixed batch of transitions. However, due to conditional Variational Autoencoders (VAE) used in the data generation module, the BCQ algorithm optimizes a lower variational bound and hence, it is not generalizable to environments with large state and action spaces. In this paper, we show that the performance of the BCQ algorithm can be further improved with the employment of one of the recent advances in deep learning, Generative Adversarial Networks. Our extensive set of experiments shows that the introduced approach significantly improves BCQ in all of the control tasks tested. Moreover, the introduced approach demonstrates robust generalizability to environments with large state and action spaces in the OpenAI Gym control suite.
Open Access
An intrinsic motivation based artificial goal generation in on-policy continuous control
(IEEE, 2022-08-29) Sağlam, Baturay; Mutlu, Furkan B.; Gönç, Kaan; Dalmaz, Onat; Kozat, Süleyman S.
This work adapts the existing theories on animal motivational systems into the reinforcement learning (RL) paradigm to constitute a directed exploration strategy in on-policy continuous control. We introduce a novel and scalable artificial bonus reward rule that encourages agents to visit useful state spaces. By unifying the intrinsic incentives in the reinforcement learning paradigm under the introduced deterministic reward rule, our method forces the value function to learn the values of unseen or less-known states and prevent premature behavior before sufficiently learning the environment. The simulation results show that the proposed algorithm considerably improves the state-of-the-art on-policy methods and improves the inherent entropy-based exploration.
Open Access
Novel deep reinforcement learning algorithms for continuous control
(2023-06) Sağlam, Baturay
Continuous control deep reinforcement learning (RL) algorithms are capable of learning complex and high-dimensional policies directly from raw sensory inputs. However, they often face challenges related to sample efficiency and exploration, which limit their practicality for real-world applications. In light of this, we introduce two novel techniques that enhance the performance of continuous control deep RL algorithms by refining their experience replay and exploration mechanisms. The first technique introduces a novel framework for sampling experiences in actor-critic methods. Specifically designed to stabilize and prevent divergence caused by Prioritized Experience Replay (PER), our framework effectively trains both actor and critic networks by striking a balance between temporal-difference error and policy gradient. Through both theoretical analysis and empirical investigations, we demonstrate that our framework is effective in improving the performance of continuous control deep RL algorithms. The second technique encompasses a directed exploration strategy that relies on intrinsic motivation. Drawing inspiration from established theories on animal motivational systems and adapting them to the actor-critic setting, our strategy showcases its effectiveness by generating exploratory behaviors that are both informative and diverse. It achieves this by maximizing the error of the value function and unifying the ex-isting intrinsic exploration objectives in the literature. We evaluate the presented methods on various continuous control benchmarks and demonstrate that they outperform state-of-the-art methods while achieving new levels of performance in deep RL.
Open Access
Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms
(2022-09) Çiçek, Doğan Can
The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which is detrimental for the agent. In this thesis, we develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence (KLPER), which prioritizes a batch of transitions rather than directly prioritizing each transition. Moreover, to reduce the off-policiness of the updates, our algorithm selects one batch among a certain number of batches and forces the agent to learn through the batch that is most likely generated by the most recent policy of the agent. Also, previous experience replay algorithms in the literature provide the same batches of transitions to the Actor and the Critic Networks of the Deep Deterministic Policy Gradients algorithms. However, the learning principles of these two cascaded components of a deep deterministic policy gradient algorithm contain dissimilarities in terms of their parameter updating strategies. Due to this fact, we attempt to decouple the training of the Actor and the Critic of the deep deterministic policy gradient algorithms in terms of the batches of transitions that they use during the training of the networks. We develop a novel algorithm, Decoupled Prioritized Experience Replay, DPER, that enables the agent to use independently sampled batches of transition for the Actor and the Critic of the Deep Deterministic Policy Gradient Algorithms. DPER utilizes Prioritized Experience Replay, PER, and Batch Prioritizing Experience Replay via KL Divergence, KLPER, to decouple the learning processes of the Critic and the Actor, respectively. We combine our algorithms, KLPER and DPER, with the current state-of-the-art Deep Deterministic Policy Gradient algorithm, DDPG, and TD3, and evaluate it on continuous control tasks. KLPER provides promising improvements for deep deterministic continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training. Moreover, DPER outperforms PER, KLPER, and Vanilla Experience Replay on most of the continuous control tasks. DPER outperforms conventional experience replay strategies without adding a significant amount of computational complexity.
Open Access
Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients
(Springer, 2024-03-02) Sağlam, Baturay; Mutlu, Furkan Burak; Çiçek, Doğan Can; Kozat, Süleyman Serdar
Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it outperforms the existing approaches and improves the baseline actor-critic algorithm in most of the environments tested.
Open Access
Pekiştirmeli öğrenme algoritmalarının DeepRTS oyunu üzerinde performans karşılaştırması
(IEEE, 2021-06-11) Şahin, Safa Onur; Yücesoy, Veysel
Bu bildiride, i) yapay zeka ile öğrenme amaçlı geliştirilen DeepRTS oyunu için makro aksiyonların kullanıldığı bir çevre oluşturulmuş ve ii) belirli pekiştirmeli öğrenme algoritmaları, üzerlerinde gerekli değişiklikler yapılarak bu çevre üzerinde eğitimleri sağlanmış ve ilgili performans analizleri yapılmı ştır. Gerçek hayat planlamalarıyla paralellik çizme ve görece düşük donanımlarla bir gerçek zamanlı strateji oyununu oynama amacıyla DeepRTS oyunu üzerinde değişikliklere gidilmiştir. ˙Ilk olarak, bir makro aksiyon seti hazırlanmış ve ajanların yalnızca bu set içerisinden aksiyon alabilmesi sağlanmıştır. ˙Ikinci olarak, gerçek hayat planlamalarına paralel olarak sistem herhangi bir anda komut alabilecek bütün birimler için aksiyon alınabilecek duruma getirilmiştir. Bu durum makro aksiyonların farklı zaman adımları sürmesi ile birlikte ele alındığında, herhangi bir anda birden fazla aksiyonun başlayıp, birden fazla aksiyonun tamamlanmasına olanak sağladığı için klasik pekiştirmeli öğrenme probleminden bir miktar ayrılmıştır. Ayrıca, literatürde bilinen kredi atama problemine farklı bir boyut ekleyerek daha karmaşık hale getirmektedir. Ajanların eğitiminde kullanılma amacı ile ofansif, defansif ve rastgele kural tabanlı ajanlar oluşturulmuş ve pekiştirmeli öğrenme tabanlı ajanların eğitimleri sırasında dönüşümlü şekilde düşman ajan olarak kullanılmıştır. Eğitimi tamamlanan ajanların birbirlerine karşı oyuncu-1 ve oyuncu-2 olarak performansı raporlanmıştır.
Open Access
Unified intrinsically motivated exploration for off-policy learning in continuous action spaces
(IEEE, 2022-08-29) Sağlam, Baturay; Mutlu, Furkan B.; Dalmaz, Onat; Kozat, Süleyman S.
Exploration is maintained in continuous control using undirected methods, in which random noise perturbs the network parameters or selected actions. Exploration that is intrinsically driven is a good alternative to undirected techniques. However, it is only studied for discrete action domains. The intrinsic incentives in the existing reinforcement learning literature are unified together in this study by a deterministic artificial goal generation rule for off-policy learning. The agent gains additional reward through this practice if it chooses actions that lead it to useful state spaces. An extensive set of experiments indicates that the introduced artificial reward rule significantly improves the performance of the off-policy baseline algorithms.
Open Access
Visual object tracking in drone images with deep reinforcement learning
(IEEE, 2021-05-05) Gözen, Derya; Özer, Sedat
There is an increasing demand on utilizing camera equipped drones and their applications in many domains varying from agriculture to entertainment and from sports events to surveillance. In such drone applications, an essential and a common task is tracking an object of interest visually. Drone (or UAV) images have different properties when compared to the ground taken (natural) images and those differences introduce additional complexities to the existing object trackers to be directly applied on drone applications. Some important differences among those complexities include (i) smaller object sizes to be tracked and (ii) different orientations and viewing angles yielding different texture and features to be observed. Therefore, new algorithms trained on drone images are needed for the drone-based applications. In this paper, we introduce a deep reinforcement learning (RL) based single object tracker that tracks an object of interest in drone images by estimating a series of actions to find the location of the object in the next frame. This is the first work introducing a single object tracker using a deep RL-based technique for drone images. Our proposed solution introduces a novel reward function that aims to reduce the total number of actions taken to estimate the object's location in the next frame and also introduces a different backbone network to be used on low resolution images. Additionally, we introduce a set of new actions into the action library to better deal with the above-mentioned complexities. We compare our proposed solutions to a state of the art tracking algorithm from the recent literature and demonstrate up to 3.87 % improvement in precision and 3.6% improvement in IoU values on the VisDrone2019 data set. We also provide additional results on OTB-100 data set and show up to 3.15% improvement in precision on the OTB-100 data set when compared to the same previous state of the art algorithm. Lastly, we analyze the ability to handle some of the challenges faced during tracking, including but not limited to occlusion, deformation, and scale variation for our proposed solutions.