Twin Delayed Deep Deterministic Policy Gradient (TD3)
Overview
TD3 is a popular DRL algorithm for continuous control. It extends DDPG with three techniques: 1) Clipped Double Q-Learning, 2) Delayed Policy Updates, and 3) Target Policy Smoothing Regularization. With these three techniques TD3 shows significantly better performance compared to DDPG.
Original paper:
Reference resources:
Implemented Variants
Variants Implemented | Description |
---|---|
td3_continuous_action.py , docs |
For continuous action space |
Below are our single-file implementations of TD3:
td3_continuous_action.py
The td3_continuous_action.py has the following features:
- For continuous action space
- Works with the
Box
observation space of low-level features - Works with the
Box
(continuous) action space
Usage
poetry install
poetry install -E pybullet
python cleanrl/td3_continuous_action.py --help
python cleanrl/td3_continuous_action.py --env-id HopperBulletEnv-v0
poetry install -E mujoco # only works in Linux
python cleanrl/td3_continuous_action.py --env-id Hopper-v3
Explanation of the logged metrics
Running python cleanrl/td3_continuous_action.py
will automatically record various metrics such as various losses in Tensorboard. Below are the documentation for these metrics:
charts/episodic_return
: episodic return of the gamecharts/SPS
: number of steps per secondlosses/qf1_loss
: the MSE between the Q values at timestep \(t\) and the target Q values at timestep \(t+1\), which minimizes temporal difference.losses/actor_loss
: implemented as-qf1(data.observations, actor(data.observations)).mean()
; it is the negative average Q values calculated based on the 1) observations and the 2) actions computed by the actor based on these observations. By minimizingactor_loss
, the optimizer updates the actors parameter using the following gradient (Fujimoto et al., 2018, Algorithm 1)2:
losses/qf1_values
: implemented as `qf1(data.observations, data.actions).view(-1); it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over esitmations happen
Implementation details
Our td3_continuous_action.py
is based on the TD3.py
from sfujim/TD3. Our td3_continuous_action.py
presents the following implementation differences.
td3_continuous_action.py
uses a two separate objectsqf1
andqf2
to represents the two Q functions in the Clipped Double Q-learning architecture, whereasTD3.py
(Fujimoto et al., 2018)2 uses a singleCritic
class that contains both Q networks. That said, these two implementations are virtually the same.td3_continuous_action.py
rescales the gradient so that the norm of the parameters does not exceed0.5
like done in PPO ( ppo2/model.py#L102-L108).
Experiment results
PR vwxyzjn/cleanrl#141 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/td3.
Below are the average episodic returns for td3_continuous_action.py
(3 random seeds). To ensure the quality of the implementation, we compared the results against (Fujimoto et al., 2018)2.
Environment | td3_continuous_action.py |
TD3.py (Fujimoto et al., 2018, Table 1)2 |
---|---|---|
HalfCheetah | 9391.52 ± 448.54 | 9636.95 ± 859.065 |
Walker2d | 3895.80 ± 333.89 | 4682.82 ± 539.64 |
Hopper | 3379.25 ± 200.22 | 3564.07 ± 114.74 |
Info
Note that td3_continuous_action.py
uses gym MuJoCo v2 environments while TD3.py
(Fujimoto et al., 2018)2 uses the gym MuJoCo v1 environments. According to the openai/gym#834, gym MuJoCo v2 environments should be equivalent to the gym MuJoCo v1 environments.
Also note the performance of our td3_continuous_action.py
seems to be worse than the reference implementation on Walker2d. This is likely due to openai/gym#938. We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated.
One other thing could cause the performance difference: the original code reported the average episodic return using determinisitc evaluation (i.e., without exploration noise), see sfujim/TD3/main.py#L15-L32
, whereas we reported the episodic return during training and the policy gets updated between environments steps.
Learning curves:
Tracked experiments and game play videos:
-
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971. https://arxiv.org/abs/1509.02971 ↩
-
Fujimoto, S., Hoof, H.V., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ArXiv, abs/1802.09477. https://arxiv.org/abs/1802.09477 ↩↩↩↩↩