TY - JOUR
T1 - Policy gradients with memory-augmented critic
T2 - Stabilizing off-policy policy gradients via differentiable memory
AU - Seno, Takuma
AU - Imai, Michita
N1 - Publisher Copyright:
© 2021, Japanese Society for Artificial Intelligence. All rights reserved.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021
Y1 - 2021
N2 - Deep reinforcement learning has been investigated in high-dimensional continuous control tasks. Deep Deterministic Policy Gradients (DDPG) is known as a highly sample-efficient policy gradients algorithm. However, it is reported that DDPG is unstable during training due to bias and variance problems of learning its action-value function. In this paper, we propose Policy Gradients with Memory Augmented Critic (PGMAC) that builds action-value function with the memory module previously proposed as Differentiable Neural Dictionary (DND). Although the DND is only studied in discrete action-space problems, we propose Action-Concatenated Key, which is a technique to combine DDPG-based policy gradient methods and DND. Furthermore, the remarkable advantage of PGMAC is shown that long-term reward calculation and weighted summation of value estimation at DND has an essential mechanism to solve the bias and variance problem. In experiment, PGMAC significantly outperformed baselines in continuous control tasks. The effects of hyperparameters were also investigated to show that the memory-augmented action-value function reduces the bias and variance in policy optimization.
AB - Deep reinforcement learning has been investigated in high-dimensional continuous control tasks. Deep Deterministic Policy Gradients (DDPG) is known as a highly sample-efficient policy gradients algorithm. However, it is reported that DDPG is unstable during training due to bias and variance problems of learning its action-value function. In this paper, we propose Policy Gradients with Memory Augmented Critic (PGMAC) that builds action-value function with the memory module previously proposed as Differentiable Neural Dictionary (DND). Although the DND is only studied in discrete action-space problems, we propose Action-Concatenated Key, which is a technique to combine DDPG-based policy gradient methods and DND. Furthermore, the remarkable advantage of PGMAC is shown that long-term reward calculation and weighted summation of value estimation at DND has an essential mechanism to solve the bias and variance problem. In experiment, PGMAC significantly outperformed baselines in continuous control tasks. The effects of hyperparameters were also investigated to show that the memory-augmented action-value function reduces the bias and variance in policy optimization.
KW - Continuous control
KW - Deep reinforcement learning
KW - Memory module
KW - Policy gradients
UR - http://www.scopus.com/inward/record.url?scp=85099153133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099153133&partnerID=8YFLogxK
U2 - 10.1527/tjsai.36-1_B-K71
DO - 10.1527/tjsai.36-1_B-K71
M3 - Article
AN - SCOPUS:85099153133
SN - 1346-0714
VL - 36
SP - 1
EP - 8
JO - Transactions of the Japanese Society for Artificial Intelligence
JF - Transactions of the Japanese Society for Artificial Intelligence
IS - 1
M1 - B-K71_1-8
ER -