Experience replay strategies for improving performance of deep off-policy actor-critic reinforcement learning algorithms

Date

2025-07

Editor(s)

Advisor

Kozat, Süleyman Serdar

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats
4
views
6
downloads

Series

Abstract

We investigate an important conflict in deep deterministic policy gradient algorithms where experience replay strategies designed to accelerate critic learning can destabilize the actor. Conventional methods, including Prioritized Experience Replay, sample a single batch of transitions to update both networks. This shared data approach ignores the fact that transitions with high temporal difference error, while beneficial for the critic’s value function estimation, may correspond to off-policy actions that can introduce misleading gradients and degrade the actor’s policy. To resolve this, we introduce Decoupled Prioritized Experience Replay, a novel framework that explicitly separates the transition sampling for the actor and critic to serve their distinct learning objectives. For the critic, it employs a conventional prioritization scheme, sampling transitions with high temporal difference error to promote efficient learning of the value function. For the actor, however, Decoupled Prioritized Experience Replay introduces a new sampling strategy. It selects batches that are more on-policy by minimizing the KullbackLeibler divergence between the actions stored in the buffer and those proposed by the current policy. We integrate Decoupled Prioritized Experience Replay with the state-of-the-art Twin Delayed Deep Deterministic policy gradient algorithm and conduct an evaluation on six standard continuous control benchmarks from OpenAI Gym and MuJoCo. The results show that Decoupled Prioritized Experience Replay consistently accelerates learning and achieves superior final performance compared to both vanilla and prioritized replay. More critically, Decoupled Prioritized Experience Replay maintains learning stability and converges to strong policies in tasks where standard prioritized replay failed to learn. Further ablation studies indicate that the decoupling mechanism is an important factor in this robustness and that the benefits of Decoupled Prioritized Experience Replay are achievable with a computationally inexpensive search, making it a practically effective solution for improving off-policy learning.

Source Title

Publisher

Course

Other identifiers

Book Title

Degree Discipline

Electrical and Electronic Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Citation

Published Version (Please cite this version)

Language

English

Type