Max van Dijck 232c331ce3 [RLlib] Rename all np.product usage to np.prod (#46317) 3 月之前
..
examples 5e89aa4def [RLlib contrib] TD3. (#36726) 1 年之前
src 5e89aa4def [RLlib contrib] TD3. (#36726) 1 年之前
tests 5e89aa4def [RLlib contrib] TD3. (#36726) 1 年之前
tuned_examples a9ac55d4f2 [RLlib; RLlib contrib] Move `tuned_examples` into rllib_contrib and remove CI learning tests for contrib algos. (#40444) 1 年之前
BUILD a9ac55d4f2 [RLlib; RLlib contrib] Move `tuned_examples` into rllib_contrib and remove CI learning tests for contrib algos. (#40444) 1 年之前
README.md 5e89aa4def [RLlib contrib] TD3. (#36726) 1 年之前
pyproject.toml 232c331ce3 [RLlib] Rename all np.product usage to np.prod (#46317) 3 月之前
requirements.txt 232c331ce3 [RLlib] Rename all np.product usage to np.prod (#46317) 3 月之前

README.md

TD3 (Twin Delayed DDPG)

TD3 While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. Twin Delayed DDPG (TD3) is an algorithm that addresses this issue by introducing three critical tricks:

Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.

Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.

Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

Together, these three tricks result in substantially improved performance over baseline DDPG.

Installation

conda create -n rllib-td3 python=3.10
conda activate rllib-td3
pip install -r requirements.txt
pip install -e '.[development]'

Usage

[TD3 Example]()