Browsing by Subject "Continuous control"

Now showing 1 - 6 of 6

Open Access
Deep intrinsically motivated exploration in continuous control
(Springer, 2023-10-26) Sağlam, Baturay; Kozat, Süleyman Serdar
In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alsetup and hyper-parameterternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function’s error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.
Open Access
An intrinsic motivation based artificial goal generation in on-policy continuous control
(IEEE, 2022-08-29) Sağlam, Baturay; Mutlu, Furkan B.; Gönç, Kaan; Dalmaz, Onat; Kozat, Süleyman S.
This work adapts the existing theories on animal motivational systems into the reinforcement learning (RL) paradigm to constitute a directed exploration strategy in on-policy continuous control. We introduce a novel and scalable artificial bonus reward rule that encourages agents to visit useful state spaces. By unifying the intrinsic incentives in the reinforcement learning paradigm under the introduced deterministic reward rule, our method forces the value function to learn the values of unseen or less-known states and prevent premature behavior before sufficiently learning the environment. The simulation results show that the proposed algorithm considerably improves the state-of-the-art on-policy methods and improves the inherent entropy-based exploration.
Open Access
Novel deep reinforcement learning algorithms for continuous control
(2023-06) Sağlam, Baturay
Continuous control deep reinforcement learning (RL) algorithms are capable of learning complex and high-dimensional policies directly from raw sensory inputs. However, they often face challenges related to sample efficiency and exploration, which limit their practicality for real-world applications. In light of this, we introduce two novel techniques that enhance the performance of continuous control deep RL algorithms by refining their experience replay and exploration mechanisms. The first technique introduces a novel framework for sampling experiences in actor-critic methods. Specifically designed to stabilize and prevent divergence caused by Prioritized Experience Replay (PER), our framework effectively trains both actor and critic networks by striking a balance between temporal-difference error and policy gradient. Through both theoretical analysis and empirical investigations, we demonstrate that our framework is effective in improving the performance of continuous control deep RL algorithms. The second technique encompasses a directed exploration strategy that relies on intrinsic motivation. Drawing inspiration from established theories on animal motivational systems and adapting them to the actor-critic setting, our strategy showcases its effectiveness by generating exploratory behaviors that are both informative and diverse. It achieves this by maximizing the error of the value function and unifying the ex-isting intrinsic exploration objectives in the literature. We evaluate the presented methods on various continuous control benchmarks and demonstrate that they outperform state-of-the-art methods while achieving new levels of performance in deep RL.
Open Access
Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms
(2022-09) Çiçek, Doğan Can
The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which is detrimental for the agent. In this thesis, we develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence (KLPER), which prioritizes a batch of transitions rather than directly prioritizing each transition. Moreover, to reduce the off-policiness of the updates, our algorithm selects one batch among a certain number of batches and forces the agent to learn through the batch that is most likely generated by the most recent policy of the agent. Also, previous experience replay algorithms in the literature provide the same batches of transitions to the Actor and the Critic Networks of the Deep Deterministic Policy Gradients algorithms. However, the learning principles of these two cascaded components of a deep deterministic policy gradient algorithm contain dissimilarities in terms of their parameter updating strategies. Due to this fact, we attempt to decouple the training of the Actor and the Critic of the deep deterministic policy gradient algorithms in terms of the batches of transitions that they use during the training of the networks. We develop a novel algorithm, Decoupled Prioritized Experience Replay, DPER, that enables the agent to use independently sampled batches of transition for the Actor and the Critic of the Deep Deterministic Policy Gradient Algorithms. DPER utilizes Prioritized Experience Replay, PER, and Batch Prioritizing Experience Replay via KL Divergence, KLPER, to decouple the learning processes of the Critic and the Actor, respectively. We combine our algorithms, KLPER and DPER, with the current state-of-the-art Deep Deterministic Policy Gradient algorithm, DDPG, and TD3, and evaluate it on continuous control tasks. KLPER provides promising improvements for deep deterministic continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training. Moreover, DPER outperforms PER, KLPER, and Vanilla Experience Replay on most of the continuous control tasks. DPER outperforms conventional experience replay strategies without adding a significant amount of computational complexity.
Open Access
Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients
(Springer, 2024-03-02) Sağlam, Baturay; Mutlu, Furkan Burak; Çiçek, Doğan Can; Kozat, Süleyman Serdar
Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it outperforms the existing approaches and improves the baseline actor-critic algorithm in most of the environments tested.
Open Access
Unified intrinsically motivated exploration for off-policy learning in continuous action spaces
(IEEE, 2022-08-29) Sağlam, Baturay; Mutlu, Furkan B.; Dalmaz, Onat; Kozat, Süleyman S.
Exploration is maintained in continuous control using undirected methods, in which random noise perturbs the network parameters or selected actions. Exploration that is intrinsically driven is a good alternative to undirected techniques. However, it is only studied for discrete action domains. The intrinsic incentives in the existing reinforcement learning literature are unified together in this study by a deterministic artificial goal generation rule for off-policy learning. The agent gains additional reward through this practice if it chooses actions that lead it to useful state spaces. An extensive set of experiments indicates that the introduced artificial reward rule significantly improves the performance of the off-policy baseline algorithms.