Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms

Çiçek, Doğan Can

Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms

Files

B161370.pdf (2.77 MB)

Date

2022-09

Authors

Çiçek, Doğan Can

Advisor

Kozat, Süleyman Serdar

BUIR Usage Stats

4
views

49
downloads

Abstract

The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which is detrimental for the agent. In this thesis, we develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence (KLPER), which prioritizes a batch of transitions rather than directly prioritizing each transition. Moreover, to reduce the off-policiness of the updates, our algorithm selects one batch among a certain number of batches and forces the agent to learn through the batch that is most likely generated by the most recent policy of the agent. Also, previous experience replay algorithms in the literature provide the same batches of transitions to the Actor and the Critic Networks of the Deep Deterministic Policy Gradients algorithms. However, the learning principles of these two cascaded components of a deep deterministic policy gradient algorithm contain dissimilarities in terms of their parameter updating strategies. Due to this fact, we attempt to decouple the training of the Actor and the Critic of the deep deterministic policy gradient algorithms in terms of the batches of transitions that they use during the training of the networks. We develop a novel algorithm, Decoupled Prioritized Experience Replay, DPER, that enables the agent to use independently sampled batches of transition for the Actor and the Critic of the Deep Deterministic Policy Gradient Algorithms. DPER utilizes Prioritized Experience Replay, PER, and Batch Prioritizing Experience Replay via KL Divergence, KLPER, to decouple the learning processes of the Critic and the Actor, respectively. We combine our algorithms, KLPER and DPER, with the current state-of-the-art Deep Deterministic Policy Gradient algorithm, DDPG, and TD3, and evaluate it on continuous control tasks. KLPER provides promising improvements for deep deterministic continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training. Moreover, DPER outperforms PER, KLPER, and Vanilla Experience Replay on most of the continuous control tasks. DPER outperforms conventional experience replay strategies without adding a significant amount of computational complexity.

Keywords

Deep reinforcement learning, Experience replay, Off-policy learning, Prioritized sampling, Continuous control

Degree Discipline

Electrical and Electronic Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Permalink

http://hdl.handle.net/11693/110839

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Novel experience replay mechanisms to improve the performance of the deep deterministic policy gradients algorithms

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type