Sven Mika d5bfb7b7da [RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2 years ago
..
tests d5bfb7b7da [RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2 years ago
README.md eae7a1f433 [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) 3 years ago
__init__.py 0f88444686 [rllib] Support multi-agent training in pipeline impls, add easy flag to enable (#7338) 4 years ago
a2c.py d5bfb7b7da [RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2 years ago
a3c.py d5bfb7b7da [RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652) 2 years ago
a3c_tf_policy.py 0b308719f8 [RLlib; Docs overhaul] Docstring cleanup: rllib/utils (#19829) 3 years ago
a3c_torch_policy.py f82880eda1 Revert "Revert [RLlib] POC: Deprecate `build_policy` (policy template) for torch only; PPOTorchPolicy (#20061) (#20399)" (#20417) 2 years ago

README.md

Advantage Actor-Critic (A2C, A3C)

Overview

Advantage Actor-Critic proposes two distributed model-free on-policy RL algorithms, A3C and A2C. These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns. The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes, which carry copies of the central node's policy network and collect data from the environment in parallel. This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers.

In A2C, the worker nodes synchronously collect data. The collected data forms a giant batch of data, from which the central node (the central policy) computes gradient updates. On the other hand, in A3C, the worker nodes generate data asynchronously, compute gradients from the data, and send computed gradients to the central node. Note that the workers in A3C may be slightly out-of-sync with the central node due to asynchrony, which may induce biases in learning.

Documentation & Implementation:

1) A2C.

**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a3c)**

**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a2c.py)**

2) A3C.

**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a3c)**

**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py)**