rllib-algorithms.rst 62 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001
  1. .. include:: /_includes/rllib/announcement.rst
  2. .. include:: /_includes/rllib/we_are_hiring.rst
  3. .. _rllib-algorithms-doc:
  4. Algorithms
  5. ==========
  6. .. tip::
  7. Check out the `environments <rllib-env.html>`__ page to learn more about different environment types.
  8. Available Algorithms - Overview
  9. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  10. ============================== ========== ============================= ================== =========== ============================================================= ===============
  11. Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support Multi-GPU
  12. ============================== ========== ============================= ================== =========== ============================================================= ===============
  13. `A2C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ A2C: tf + torch
  14. `A3C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ No
  15. `AlphaZero`_ torch **Yes** `+parametric`_ No No No
  16. `APPO`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ tf + torch
  17. `ARS`_ tf + torch **Yes** **Yes** No No
  18. `Bandits`_ (`TS`_ & `LinUCB`_) torch **Yes** `+parametric`_ No **Yes** No
  19. `BC`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_ torch
  20. `CQL`_ tf + torch No **Yes** No tf + torch
  21. `CRR`_ torch **Yes** `+parametric`_ **Yes** **Yes** torch
  22. `DDPG`_ tf + torch No **Yes** **Yes** torch
  23. `APEX-DDPG`_ tf + torch No **Yes** **Yes** torch
  24. `ES`_ tf + torch **Yes** **Yes** No No
  25. `Dreamer`_ torch No **Yes** No `+RNN`_ torch
  26. `DQN`_, `Rainbow`_ tf + torch **Yes** `+parametric`_ No **Yes** tf + torch
  27. `APEX-DQN`_ tf + torch **Yes** `+parametric`_ No **Yes** torch
  28. `IMPALA`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ tf + torch
  29. `LeelaChessZero`_ torch **Yes** `+parametric`_ No **Yes** torch
  30. `MAML`_ tf + torch No **Yes** No torch
  31. `MARWIL`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_ torch
  32. `MBMPO`_ torch No **Yes** No torch
  33. `PG`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ tf + torch
  34. `PPO`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ tf + torch
  35. `R2D2`_ tf + torch **Yes** `+parametric`_ No **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+autoreg`_ torch
  36. `SAC`_ tf + torch **Yes** **Yes** **Yes** torch
  37. `SlateQ`_ tf + torch **Yes** (multi-discr. slates) No No torch
  38. `TD3`_ tf + torch No **Yes** **Yes** torch
  39. ============================== ========== ============================= ================== =========== ============================================================= ===============
  40. Multi-Agent only Methods
  41. ================================ ========== ======================= ================== =========== =====================
  42. Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support
  43. ================================ ========== ======================= ================== =========== =====================
  44. `QMIX`_ torch **Yes** `+parametric`_ No **Yes** `+RNN`_
  45. `MADDPG`_ tf **Yes** Partial **Yes**
  46. `Parameter Sharing`_ Depends on bootstrapped algorithm
  47. -------------------------------- ---------------------------------------------------------------------------------------
  48. `Fully Independent Learning`_ Depends on bootstrapped algorithm
  49. -------------------------------- ---------------------------------------------------------------------------------------
  50. `Shared Critic Methods`_ Depends on bootstrapped algorithm
  51. ================================ =======================================================================================
  52. Exploration-based plug-ins (can be combined with any algo)
  53. ================================ ========== ======================= ================== =========== =====================
  54. Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support
  55. ================================ ========== ======================= ================== =========== =====================
  56. `Curiosity`_ tf + torch **Yes** `+parametric`_ No **Yes** `+RNN`_
  57. ================================ ========== ======================= ================== =========== =====================
  58. .. _`APEX-DQN`: rllib-algorithms.html#apex
  59. .. _`APEX-DDPG`: rllib-algorithms.html#apex
  60. .. _`+autoreg`: rllib-models.html#autoregressive-action-distributions
  61. .. _`+LSTM auto-wrapping`: rllib-models.html#built-in-models
  62. .. _`+parametric`: rllib-models.html#variable-length-parametric-action-spaces
  63. .. _`Rainbow`: rllib-algorithms.html#dqn
  64. .. _`+RNN`: rllib-models.html#rnns
  65. .. _`+Attention`: rllib-models.html#attention
  66. .. _`TS`: rllib-algorithms.html#lints
  67. .. _`LinUCB`: rllib-algorithms.html#lin-ucb
  68. Offline
  69. ~~~~~~~
  70. .. _bc:
  71. Behavior Cloning (BC; derived from MARWIL implementation)
  72. ---------------------------------------------------------
  73. |pytorch| |tensorflow|
  74. `[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
  75. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bc/bc.py>`__
  76. Our behavioral cloning implementation is directly derived from our `MARWIL`_ implementation,
  77. with the only difference being the ``beta`` parameter force-set to 0.0. This makes
  78. BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards.
  79. BC requires the `offline datasets API <rllib-offline.html>`__ to be used.
  80. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/bc/cartpole-bc.yaml>`__
  81. **BC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  82. .. autoclass:: ray.rllib.algorithms.bc.bc.BCConfig
  83. :members: training
  84. .. _crr:
  85. Critic Regularized Regression (CRR)
  86. -----------------------------------
  87. |pytorch|
  88. `[paper] <https://arxiv.org/abs/2006.15134>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/crr/crr.py>`__
  89. CRR is another offline RL algorithm based on Q-learning that can learn from an offline experience replay.
  90. The challenge in applying existing Q-learning algorithms to offline RL lies in the overestimation of the Q-function, as well as, the lack of exploration beyond the observed data.
  91. The latter becomes increasingly important during bootstrapping in the bellman equation, where the Q-function queried for the next state's Q-value(s) does not have support in the observed data.
  92. To mitigate these issues, CRR implements a simple and yet powerful idea of "value-filtered regression".
  93. The key idea is to use a learned critic to filter-out the non-promising transitions from the replay dataset. For more details, please refer to the paper (see link above).
  94. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/cartpole-v1-crr.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/pendulum-v1-crr.yaml>`__
  95. .. autoclass:: ray.rllib.algorithms.crr.crr.CRRConfig
  96. :members: training
  97. .. _cql:
  98. Conservative Q-Learning (CQL)
  99. -----------------------------
  100. |pytorch| |tensorflow|
  101. `[paper] <https://arxiv.org/abs/2006.04779>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/cql/cql.py>`__
  102. In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples.
  103. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via
  104. conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss.
  105. This ensures that the critic does not output overly-optimistic Q-values. This conservative
  106. correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
  107. RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the ``bc_iters`` parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the `D4RL <https://github.com/rail-berkeley/d4rl>`__ benchmark, which has pre-collected offline datasets for many types of environments.
  108. Tuned examples: `HalfCheetah Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/halfcheetah-cql.yaml>`__, `Hopper Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/hopper-cql.yaml>`__
  109. **CQL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  110. .. autoclass:: ray.rllib.algorithms.cql.cql.CQLConfig
  111. :members: training
  112. .. _marwil:
  113. Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
  114. -----------------------------------------------------------
  115. |pytorch| |tensorflow|
  116. `[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
  117. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/marwil/marwil.py>`__
  118. MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data.
  119. When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see `BC`_).
  120. MARWIL requires the `offline datasets API <rllib-offline.html>`__ to be used.
  121. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/marwil/cartpole-marwil.yaml>`__
  122. **MARWIL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  123. .. autoclass:: ray.rllib.algorithms.marwil.marwil.MARWILConfig
  124. :members: training
  125. Model-free On-policy RL
  126. ~~~~~~~~~~~~~~~~~~~~~~~
  127. .. _appo:
  128. Asynchronous Proximal Policy Optimization (APPO)
  129. ------------------------------------------------
  130. |pytorch| |tensorflow|
  131. `[paper] <https://arxiv.org/abs/1707.06347>`__
  132. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/appo/appo.py>`__
  133. We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.
  134. .. tip::
  135. APPO is not always more efficient; it is often better to use :ref:`standard PPO <ppo>` or :ref:`IMPALA <impala>`.
  136. .. figure:: images/impala-arch.svg
  137. APPO architecture (same as IMPALA)
  138. Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/appo/pong-appo.yaml>`__
  139. **APPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  140. .. autoclass:: ray.rllib.algorithms.appo.appo.APPOConfig
  141. :members: training
  142. .. _ddppo:
  143. Decentralized Distributed Proximal Policy Optimization (DD-PPO)
  144. ---------------------------------------------------------------
  145. |pytorch|
  146. `[paper] <https://arxiv.org/abs/1911.00357>`__
  147. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddppo/ddppo.py>`__
  148. Unlike APPO or PPO, with DD-PPO policy improvement is no longer done centralized in the algorithm process. Instead, gradients are computed remotely on each rollout worker and all-reduced at each mini-batch using `torch distributed <https://pytorch.org/docs/stable/distributed.html>`__. This allows each worker's GPU to be used both for sampling and for training.
  149. .. tip::
  150. DD-PPO is best for envs that require GPUs to function, or if you need to scale out SGD to multiple nodes. If you don't meet these requirements, `standard PPO <#proximal-policy-optimization-ppo>`__ will be more efficient.
  151. .. figure:: images/ddppo-arch.svg
  152. DD-PPO architecture (both sampling and learning are done on worker GPUs)
  153. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/cartpole-ddppo.yaml>`__, `BreakoutNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/atari-ddppo.yaml>`__
  154. **DDPPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  155. .. autoclass:: ray.rllib.algorithms.ddppo.ddppo.DDPPOConfig
  156. :members: training
  157. .. _ppo:
  158. Proximal Policy Optimization (PPO)
  159. ----------------------------------
  160. |pytorch| |tensorflow|
  161. `[paper] <https://arxiv.org/abs/1707.06347>`__
  162. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py>`__
  163. PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.
  164. .. tip::
  165. If you need to scale out with GPUs on multiple nodes, consider using `decentralized PPO <#decentralized-distributed-proximal-policy-optimization-dd-ppo>`__.
  166. .. figure:: images/ppo-arch.svg
  167. PPO architecture
  168. Tuned examples:
  169. `Unity3D Soccer (multi-agent: Strikers vs Goalie) <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/unity3d-soccer-strikers-vs-goalie-ppo.yaml>`__,
  170. `Humanoid-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/humanoid-ppo-gae.yaml>`__,
  171. `Hopper-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/hopper-ppo.yaml>`__,
  172. `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/pendulum-ppo.yaml>`__,
  173. `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/pong-ppo.yaml>`__,
  174. `Walker2d-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/walker2d-ppo.yaml>`__,
  175. `HalfCheetah-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/halfcheetah-ppo.yaml>`__,
  176. `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/atari-ppo.yaml>`__
  177. **Atari results**: `more details <https://github.com/ray-project/rl-experiments>`__
  178. ============= ============== ============== ==================
  179. Atari env RLlib PPO @10M RLlib PPO @25M Baselines PPO @10M
  180. ============= ============== ============== ==================
  181. BeamRider 2807 4480 ~1800
  182. Breakout 104 201 ~250
  183. Qbert 11085 14247 ~14000
  184. SpaceInvaders 671 944 ~800
  185. ============= ============== ============== ==================
  186. **Scalability:** `more details <https://github.com/ray-project/rl-experiments>`__
  187. ============= ========================= =============================
  188. MuJoCo env RLlib PPO 16-workers @ 1h Fan et al PPO 16-workers @ 1h
  189. ============= ========================= =============================
  190. HalfCheetah 9664 ~7700
  191. ============= ========================= =============================
  192. .. figure:: images/ppo.png
  193. :width: 500px
  194. RLlib's multi-GPU PPO scales to multiple GPUs and hundreds of CPUs on solving the Humanoid-v1 task. Here we compare against a reference MPI-based implementation.
  195. **PPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  196. .. autoclass:: ray.rllib.algorithms.ppo.ppo.PPOConfig
  197. :members: training
  198. .. _impala:
  199. Importance Weighted Actor-Learner Architecture (IMPALA)
  200. -------------------------------------------------------
  201. |pytorch| |tensorflow|
  202. `[paper] <https://arxiv.org/abs/1802.01561>`__
  203. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/impala/impala.py>`__
  204. In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code <https://github.com/deepmind/scalable_agent/blob/master/vtrace.py>`__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model <rllib-models.html#custom-models-tensorflow>`__. Multiple learner GPUs and experience replay are also supported.
  205. .. figure:: images/impala-arch.svg
  206. IMPALA architecture
  207. Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala.yaml>`__, `vectorized configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-vectorized.yaml>`__, `multi-gpu configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-fast.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala.yaml>`__
  208. **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
  209. ============= ================================== ====================================
  210. Atari env RLlib IMPALA 32-workers Mnih et al A3C 16-workers
  211. ============= ================================== ====================================
  212. BeamRider 2071 ~3000
  213. Breakout 385 ~150
  214. Qbert 4068 ~1000
  215. SpaceInvaders 719 ~600
  216. ============= ================================== ====================================
  217. **Scalability:**
  218. ============= =============================== =================================
  219. Atari env RLlib IMPALA 32-workers @1 hour Mnih et al A3C 16-workers @1 hour
  220. ============= =============================== =================================
  221. BeamRider 3181 ~1000
  222. Breakout 538 ~10
  223. Qbert 10850 ~500
  224. SpaceInvaders 843 ~300
  225. ============= =============================== =================================
  226. .. figure:: images/impala.png
  227. Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers.
  228. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).
  229. **IMPALA-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  230. .. autoclass:: ray.rllib.algorithms.impala.impala.ImpalaConfig
  231. :members: training
  232. .. _a2c:
  233. Advantage Actor-Critic (A2C)
  234. ----------------------------
  235. |pytorch| |tensorflow|
  236. `[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py>`__
  237. A2C scales to 16-32+ worker processes depending on the environment and supports microbatching
  238. (i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config.
  239. Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
  240. .. figure:: images/a2c-arch.svg
  241. A2C architecture
  242. Tuned examples: `Atari environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a2c/atari-a2c.yaml>`__
  243. .. tip::
  244. Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
  245. **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
  246. ============= ======================== ==============================
  247. Atari env RLlib A2C 5-workers Mnih et al A3C 16-workers
  248. ============= ======================== ==============================
  249. BeamRider 1401 ~3000
  250. Breakout 374 ~150
  251. Qbert 3620 ~1000
  252. SpaceInvaders 692 ~600
  253. ============= ======================== ==============================
  254. **A2C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  255. .. autoclass:: ray.rllib.algorithms.a2c.a2c.A2CConfig
  256. :members: training
  257. .. _a3c:
  258. Asynchronous Advantage Actor-Critic (A3C)
  259. -----------------------------------------
  260. |pytorch| |tensorflow|
  261. `[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__
  262. A3C is the asynchronous version of A2C, where gradients are computed on the workers directly after trajectory rollouts,
  263. and only then shipped to a central learner to accumulate these gradients on the central model. After the central model update, parameters are broadcast back to
  264. all workers.
  265. Similar to A2C, A3C scales to 16-32+ worker processes depending on the environment.
  266. Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__
  267. .. tip::
  268. Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
  269. **A3C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  270. .. autoclass:: ray.rllib.algorithms.a3c.a3c.A3CConfig
  271. :members: training
  272. .. _pg:
  273. Policy Gradients (PG)
  274. ---------------------
  275. |pytorch| |tensorflow|
  276. `[paper] <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>`__
  277. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/pg/pg.py>`__
  278. We include a vanilla policy gradients implementation as an example algorithm.
  279. .. figure:: images/a2c-arch.svg
  280. Policy gradients architecture (same as A2C)
  281. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/pg/cartpole-pg.yaml>`__
  282. **PG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  283. .. autoclass:: ray.rllib.algorithms.pg.pg.PGConfig
  284. :members: training
  285. .. _maml:
  286. Model-Agnostic Meta-Learning (MAML)
  287. -----------------------------------
  288. |pytorch| |tensorflow|
  289. `[paper] <https://arxiv.org/abs/1703.03400>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/maml/maml.py>`__
  290. RLlib's MAML implementation is a meta-learning method for learning and quick adaptation across different tasks for continuous control. Code here is adapted from https://github.com/jonasrothfuss, which outperforms vanilla MAML and avoids computation of the higher order gradients during the meta-update step. MAML is evaluated on custom environments that are described in greater detail `here <https://github.com/ray-project/ray/blob/master/rllib/env/apis/task_settable_env.py>`__.
  291. MAML uses additional metrics to measure performance; ``episode_reward_mean`` measures the agent's returns before adaptation, ``episode_reward_mean_adapt_N`` measures the agent's returns after N gradient steps of inner adaptation, and ``adaptation_delta`` measures the difference in performance before and after adaptation. Examples can be seen `here <https://github.com/ray-project/rl-experiments/tree/master/maml>`__.
  292. Tuned examples: HalfCheetahRandDirecEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/halfcheetah_rand_direc.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/halfcheetah-rand-direc-maml.yaml>`__), AntRandGoalEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/ant_rand_goal.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/ant-rand-goal-maml.yaml>`__), PendulumMassEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/pendulum_mass.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/pendulum-mass-maml.yaml>`__)
  293. **MAML-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  294. .. autoclass:: ray.rllib.algorithms.maml.maml.MAMLConfig
  295. :members: training
  296. Model-free Off-policy RL
  297. ~~~~~~~~~~~~~~~~~~~~~~~~
  298. .. _apex:
  299. Distributed Prioritized Experience Replay (Ape-X)
  300. -------------------------------------------------
  301. |pytorch| |tensorflow|
  302. `[paper] <https://arxiv.org/abs/1803.00933>`__
  303. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__
  304. Ape-X variations of DQN and DDPG (`APEX_DQN <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__, `APEX_DDPG <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_ddpg/apex_ddpg.py>`__) use a single GPU learner and many CPU workers for experience collection. Experience collection can scale to hundreds of CPU workers due to the distributed prioritization of experience prior to storage in replay buffers.
  305. .. figure:: images/apex-arch.svg
  306. Ape-X architecture
  307. Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/pong-apex-dqn.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/pendulum-apex-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/mountaincarcontinuous-apex-ddpg.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/atari-apex-dqn.yaml>`__.
  308. **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
  309. ============= ================================ ========================================
  310. Atari env RLlib Ape-X 8-workers Mnih et al Async DQN 16-workers
  311. ============= ================================ ========================================
  312. BeamRider 6134 ~6000
  313. Breakout 123 ~50
  314. Qbert 15302 ~1200
  315. SpaceInvaders 686 ~600
  316. ============= ================================ ========================================
  317. **Scalability**:
  318. ============= ================================ ========================================
  319. Atari env RLlib Ape-X 8-workers @1 hour Mnih et al Async DQN 16-workers @1 hour
  320. ============= ================================ ========================================
  321. BeamRider 4873 ~1000
  322. Breakout 77 ~10
  323. Qbert 4083 ~500
  324. SpaceInvaders 646 ~300
  325. ============= ================================ ========================================
  326. .. figure:: images/apex.png
  327. Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on PongNoFrameskip-v4.
  328. **Ape-X specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  329. .. autoclass:: ray.rllib.algorithms.apex_dqn.apex_dqn.ApexDQNConfig
  330. :members: training
  331. .. _r2d2:
  332. Recurrent Replay Distributed DQN (R2D2)
  333. ---------------------------------------
  334. |pytorch| |tensorflow|
  335. `[paper] <https://openreview.net/pdf?id=r1lyTjAqYX>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/r2d2/r2d2.py>`__
  336. R2D2 can be scaled by increasing the number of workers. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default.
  337. Tuned examples: `Stateless CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/r2d2/stateless-cartpole-r2d2.yaml>`__
  338. .. _dqn:
  339. Deep Q Networks (DQN, Rainbow, Parametric DQN)
  340. ----------------------------------------------
  341. |pytorch| |tensorflow|
  342. `[paper] <https://arxiv.org/abs/1312.5602>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn.py>`__
  343. DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN <rllib-models.html#variable-length-parametric-action-spaces>`__.
  344. .. figure:: images/dqn-arch.svg
  345. DQN architecture
  346. Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-dqn.yaml>`__, `Rainbow configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-rainbow.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dqn.yaml>`__, `with Dueling and Double-Q <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-duel-ddqn.yaml>`__, `with Distributional DQN <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dist-dqn.yaml>`__.
  347. .. tip::
  348. Consider using `Ape-X <#distributed-prioritized-experience-replay-ape-x>`__ for faster training with similar timestep efficiency.
  349. .. hint::
  350. For a complete `rainbow <https://arxiv.org/pdf/1710.02298.pdf>`__ setup,
  351. make the following changes to the default DQN config:
  352. ``"n_step": [between 1 and 10],
  353. "noisy": True,
  354. "num_atoms": [more than 1],
  355. "v_min": -10.0,
  356. "v_max": 10.0``
  357. (set ``v_min`` and ``v_max`` according to your expected range of returns).
  358. **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
  359. ============= ======================== ============================= ============================== ===============================
  360. Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
  361. ============= ======================== ============================= ============================== ===============================
  362. BeamRider 2869 1910 4447 ~2000
  363. Breakout 287 312 410 ~150
  364. Qbert 3921 7968 15780 ~4000
  365. SpaceInvaders 650 1001 1025 ~500
  366. ============= ======================== ============================= ============================== ===============================
  367. **DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  368. .. autoclass:: ray.rllib.algorithms.dqn.dqn.DQNConfig
  369. :members: training
  370. .. _ddpg:
  371. Deep Deterministic Policy Gradients (DDPG)
  372. ------------------------------------------
  373. |pytorch| |tensorflow|
  374. `[paper] <https://arxiv.org/abs/1509.02971>`__
  375. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddpg/ddpg.py>`__
  376. DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers or using Ape-X.
  377. The improvements from `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__ are available as ``TD3``.
  378. .. figure:: images/dqn-arch.svg
  379. DDPG architecture (same as DQN)
  380. Tuned examples: `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/pendulum-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/mountaincarcontinuous-ddpg.yaml>`__, `HalfCheetah-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/halfcheetah-ddpg.yaml>`__.
  381. **DDPG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  382. .. autoclass:: ray.rllib.algorithms.ddpg.ddpg.DDPGConfig
  383. :members: training
  384. .. _td3:
  385. Twin Delayed DDPG (TD3)
  386. -----------------------
  387. |pytorch| |tensorflow|
  388. `[paper] <https://arxiv.org/abs/1509.02971>`__
  389. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/td3/td3.py>`__
  390. TD3 represents an improvement over DDPG. Its implementation is available in RLlib as `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__.
  391. Tuned examples: `TD3 Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/pendulum-td3.yaml>`__, `TD3 InvertedPendulum-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/invertedpendulum-td3.yaml>`__, `TD3 Mujoco suite (Ant-v2, HalfCheetah-v2, Hopper-v2, Walker2d-v2) <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/mujoco-td3.yaml>`__.
  392. **TD3-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  393. .. autoclass:: ray.rllib.algorithms.td3.td3.TD3Config
  394. :members: training
  395. .. _sac:
  396. Soft Actor Critic (SAC)
  397. ------------------------
  398. |pytorch| |tensorflow|
  399. `[original paper] <https://arxiv.org/pdf/1801.01290>`__, `[follow up paper] <https://arxiv.org/pdf/1812.05905.pdf>`__, `[discrete actions paper] <https://arxiv.org/pdf/1910.07207v2.pdf>`__
  400. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/sac/sac.py>`__
  401. .. figure:: images/dqn-arch.svg
  402. SAC architecture (same as DQN)
  403. RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs.
  404. Note that SAC has two fields to configure for custom models: ``policy_model_config`` and ``q_model_config``, the ``model`` field of the config will be ignored.
  405. Tuned examples (continuous actions):
  406. `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/sac/pendulum-sac.yaml>`__,
  407. `HalfCheetah-v3 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/sac/halfcheetah-sac.yaml>`__,
  408. Tuned examples (discrete actions):
  409. `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/sac/cartpole-sac.yaml>`__
  410. **MuJoCo results @3M steps:** `more details <https://github.com/ray-project/rl-experiments>`__
  411. ============= ========== ===================
  412. MuJoCo env RLlib SAC Haarnoja et al SAC
  413. ============= ========== ===================
  414. HalfCheetah 13000 ~15000
  415. ============= ========== ===================
  416. **SAC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  417. .. autoclass:: ray.rllib.algorithms.sac.sac.SACConfig
  418. :members: training
  419. Model-based RL
  420. ~~~~~~~~~~~~~~
  421. .. _dreamer:
  422. Dreamer
  423. -------
  424. |pytorch|
  425. `[paper] <https://arxiv.org/abs/1912.01603>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dreamer/dreamer.py>`__
  426. Dreamer is an image-only model-based RL method that learns by imagining trajectories in the future and is evaluated on the DeepMind Control Suite `environments <https://github.com/ray-project/ray/blob/master/rllib/examples/env/dm_control_suite.py>`__. RLlib's Dreamer is adapted from the `official Google research repo <https://github.com/google-research/dreamer>`__.
  427. To visualize learning, RLlib Dreamer's imagined trajectories are logged as gifs in TensorBoard. Examples of such can be seen `here <https://github.com/ray-project/rl-experiments>`__.
  428. Tuned examples: `Deepmind Control Environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dreamer/dreamer-deepmind-control.yaml>`__
  429. **Deepmind Control results @1M steps:** `more details <https://github.com/ray-project/rl-experiments>`__
  430. ============= ============== ======================
  431. DMC env RLlib Dreamer Danijar et al Dreamer
  432. ============= ============== ======================
  433. Walker-Walk 920 ~930
  434. Cheetah-Run 640 ~800
  435. ============= ============== ======================
  436. **Dreamer-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  437. .. autoclass:: ray.rllib.algorithms.dreamer.dreamer.DreamerConfig
  438. :members: training
  439. .. _mbmpo:
  440. Model-Based Meta-Policy-Optimization (MB-MPO)
  441. ---------------------------------------------
  442. |pytorch|
  443. `[paper] <https://arxiv.org/pdf/1809.05214.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/mbmpo/mbmpo.py>`__
  444. RLlib's MBMPO implementation is a Dyna-styled model-based RL method that learns based on the predictions of an ensemble of transition-dynamics models. Similar to MAML, MBMPO metalearns an optimal policy by treating each dynamics model as a different task. Code here is adapted from https://github.com/jonasrothfuss/model_ensemble_meta_learning. Similar to the original paper, MBMPO is evaluated on MuJoCo, with the horizon set to 200 instead of the default 1000.
  445. Additional statistics are logged in MBMPO. Each MBMPO iteration corresponds to multiple MAML iterations, and ``MAMLIter$i$_DynaTrajInner_$j$_episode_reward_mean`` measures the agent's returns across the dynamics models at iteration ``i`` of MAML and step ``j`` of inner adaptation. Examples can be seen `here <https://github.com/ray-project/rl-experiments/tree/master/mbmpo>`__.
  446. Tuned examples (continuous actions):
  447. `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/mbmpo/pendulum-mbmpo.yaml>`__,
  448. `HalfCheetah <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/mbmpo/halfcheetah-mbmpo.yaml>`__,
  449. `Hopper <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/mbmpo/hopper-mbmpo.yaml>`__,
  450. Tuned examples (discrete actions):
  451. `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/mbmpo/cartpole-mbmpo.yaml>`__
  452. **MuJoCo results @100K steps:** `more details <https://github.com/ray-project/rl-experiments>`__
  453. ============= ============ ====================
  454. MuJoCo env RLlib MBMPO Clavera et al MBMPO
  455. ============= ============ ====================
  456. HalfCheetah 520 ~550
  457. Hopper 620 ~650
  458. ============= ============ ====================
  459. **MBMPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  460. .. autoclass:: ray.rllib.algorithms.mbmpo.mbmpo.MBMPOConfig
  461. :members: training
  462. Derivative-free
  463. ~~~~~~~~~~~~~~~
  464. .. _ars:
  465. Augmented Random Search (ARS)
  466. -----------------------------
  467. |pytorch| |tensorflow|
  468. `[paper] <https://arxiv.org/abs/1803.07055>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ars/ars.py>`__
  469. ARS is a random search method for training linear policies for continuous control problems. Code here is adapted from https://github.com/modestyachts/ARS to integrate with RLlib APIs.
  470. Tuned examples: `CartPole-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ars/cartpole-ars.yaml>`__, `Swimmer-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ars/swimmer-ars.yaml>`__
  471. **ARS-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  472. .. autoclass:: ray.rllib.algorithms.ars.ars.ARSConfig
  473. :members: training
  474. .. _es:
  475. Evolution Strategies (ES)
  476. -------------------------
  477. |pytorch| |tensorflow|
  478. `[paper] <https://arxiv.org/abs/1703.03864>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/es/es.py>`__
  479. Code here is adapted from https://github.com/openai/evolution-strategies-starter to execute in the distributed setting with Ray.
  480. Tuned examples: `Humanoid-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/es/humanoid-es.yaml>`__
  481. **Scalability:**
  482. .. figure:: images/es.png
  483. :width: 500px
  484. RLlib's ES implementation scales further and is faster than a reference Redis implementation on solving the Humanoid-v1 task.
  485. **ES-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  486. .. autoclass:: ray.rllib.algorithms.es.es.ESConfig
  487. :members: training
  488. RL for recommender systems
  489. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  490. .. _slateq:
  491. SlateQ
  492. -------
  493. |pytorch|
  494. `[paper] <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9f91de1fa0ac351ecb12e4062a37afb896aa1463.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/slateq/slateq.py>`__
  495. SlateQ is a model-free RL method that builds on top of DQN and generates recommendation slates for recommender system environments. Since these types of environments come with large combinatorial action spaces, SlateQ mitigates this by decomposing the Q-value into single-item Q-values and solves the decomposed objective via mixing integer programming and deep learning optimization. SlateQ can be evaluated on Google's RecSim `environment <https://github.com/google-research/recsim>`__. `An RLlib wrapper for RecSim can be found here < <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__.
  496. RecSim environment wrapper: `Google RecSim <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__
  497. **SlateQ-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  498. .. autoclass:: ray.rllib.algorithms.slateq.slateq.SlateQConfig
  499. :members: training
  500. Contextual Bandits
  501. ~~~~~~~~~~~~~~~~~~
  502. .. _bandits:
  503. The Multi-armed bandit (MAB) problem provides a simplified RL setting that
  504. involves learning to act under one situation only, i.e. the context (observation/state) and arms (actions/items-to-select) are both fixed.
  505. Contextual bandit is an extension of the MAB problem, where at each
  506. round the agent has access not only to a set of bandit arms/actions but also
  507. to a context (state) associated with this iteration. The context changes
  508. with each iteration, but, is not affected by the action that the agent takes.
  509. The objective of the agent is to maximize the cumulative rewards, by
  510. collecting enough information about how the context and the rewards of the
  511. arms are related to each other. The agent does this by balancing the
  512. trade-off between exploration and exploitation.
  513. Contextual bandit algorithms typically consist of an action-value model (Q
  514. model) and an exploration strategy (epsilon-greedy, LinUCB, Thompson Sampling etc.)
  515. RLlib supports the following online contextual bandit algorithms,
  516. named after the exploration strategies that they employ:
  517. .. _lin-ucb:
  518. Linear Upper Confidence Bound (BanditLinUCB)
  519. --------------------------------------------
  520. |pytorch|
  521. `[paper] <http://rob.schapire.net/papers/www10.pdf>`__ `[implementation]
  522. <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bandit/bandit.py>`__
  523. LinUCB assumes a linear dependency between the expected reward of an action and
  524. its context. It estimates the Q value of each action using ridge regression.
  525. It constructs a confidence region around the weights of the linear
  526. regression model and uses this confidence ellipsoid to estimate the
  527. uncertainty of action values.
  528. Tuned examples:
  529. `SimpleContextualBandit <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bandit/tests/test_bandits.py>`__,
  530. `UCB Bandit on RecSim <https://github.com/ray-project/ray/blob/master/rllib/examples/bandit/tune_lin_ucb_train_recsim_env.py>`__.
  531. `ParametricItemRecoEnv <https://github.com/ray-project/ray/blob/master/rllib/examples/bandit/tune_lin_ucb_train_recommendation.py>`__.
  532. **LinUCB-specific configs** (see also `common configs <rllib-training
  533. .html#common-parameters>`__):
  534. .. autoclass:: ray.rllib.algorithms.bandit.bandit.BanditLinUCBConfig
  535. :members: training
  536. .. _lints:
  537. Linear Thompson Sampling (BanditLinTS)
  538. --------------------------------------
  539. |pytorch|
  540. `[paper] <http://proceedings.mlr.press/v28/agrawal13.pdf>`__
  541. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bandit/bandit.py>`__
  542. Like LinUCB, LinTS also assumes a linear dependency between the expected
  543. reward of an action and its context and uses online ridge regression to
  544. estimate the Q values of actions given the context. It assumes a Gaussian
  545. prior on the weights and a Gaussian likelihood function. For deciding which
  546. action to take, the agent samples weights for each arm, using
  547. the posterior distributions, and plays the arm that produces the highest reward.
  548. Tuned examples:
  549. `SimpleContextualBandit <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bandit/tests/test_bandits.py>`__,
  550. `WheelBandit <https://github.com/ray-project/ray/blob/master/rllib/examples/bandit/tune_lin_ts_train_wheel_env.py>`__.
  551. **LinTS-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  552. .. autoclass:: ray.rllib.algorithms.bandit.bandit.BanditLinTSConfig
  553. :members: training
  554. Multi-agent
  555. ~~~~~~~~~~~
  556. .. _parameter:
  557. Parameter Sharing
  558. -----------------
  559. `[paper] <http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf>`__, `[paper] <https://arxiv.org/abs/2005.13625>`__ and `[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.
  560. Tuned examples: `PettingZoo <https://github.com/PettingZoo-Team/PettingZoo>`__, `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_parameter_sharing.py>`__, `rock-paper-scissors <https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py>`__, `multi-agent cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
  561. .. _qmix:
  562. QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)
  563. ---------------------------------------------------
  564. |pytorch|
  565. `[paper] <https://arxiv.org/abs/1803.11485>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/qmix/qmix.py>`__ Q-Mix is a specialized multi-agent algorithm. Code here is adapted from https://github.com/oxwhirl/pymarl_alpha to integrate with RLlib multi-agent APIs. To use Q-Mix, you must specify an agent `grouping <rllib-env.html#grouping-agents>`__ in the environment (see the `two-step game example <https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py>`__). Currently, all agents in the group must be homogeneous. The algorithm can be scaled by increasing the number of workers or using Ape-X.
  566. Tuned examples: `Two-step game <https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py>`__
  567. **QMIX-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  568. .. autoclass:: ray.rllib.algorithms.qmix.qmix.QMixConfig
  569. :members: training
  570. .. _maddpg:
  571. Multi-Agent Deep Deterministic Policy Gradient (MADDPG)
  572. -------------------------------------------------------
  573. |tensorflow|
  574. `[paper] <https://arxiv.org/abs/1706.02275>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/maddpg/maddpg.py>`__ MADDPG is a DDPG centralized/shared critic algorithm. Code here is adapted from https://github.com/openai/maddpg to integrate with RLlib multi-agent APIs. Please check `justinkterry/maddpg-rllib <https://github.com/jkterry1/maddpg-rllib>`__ for examples and more information. Note that the implementation here is based on OpenAI's, and is intended for use with the discrete MPE environments. Please also note that people typically find this method difficult to get to work, even with all applicable optimizations for their environment applied. This method should be viewed as for research purposes, and for reproducing the results of the paper introducing it.
  575. **MADDPG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  576. Tuned examples: `Multi-Agent Particle Environment <https://github.com/wsjeon/maddpg-rllib/tree/master/plots>`__, `Two-step game <https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py>`__
  577. .. autoclass:: ray.rllib.algorithms.maddpg.maddpg.MADDPGConfig
  578. :members: training
  579. .. _sc:
  580. Shared Critic Methods
  581. ---------------------
  582. `[instructions] <https://docs.ray.io/en/master/rllib-env.html#implementing-a-centralized-critic>`__ Shared critic methods are when all agents use a single parameter shared critic network (in some cases with access to more of the observation space than agents can see). Note that many specialized multi-agent algorithms such as MADDPG are mostly shared critic forms of their single-agent algorithm (DDPG in the case of MADDPG).
  583. Tuned examples: `TwoStepGame <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__
  584. Others
  585. ~~~~~~
  586. .. _alphazero:
  587. Single-Player Alpha Zero (AlphaZero)
  588. ------------------------------------
  589. |pytorch|
  590. `[paper] <https://arxiv.org/abs/1712.01815>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/alpha_zero>`__ AlphaZero is an RL agent originally designed for two-player games. This version adapts it to handle single player games. The code can be scaled to any number of workers. It also implements the ranked rewards `(R2) <https://arxiv.org/abs/1807.01672>`__ strategy to enable self-play even in the one-player setting. The code is mainly purposed to be used for combinatorial optimization.
  591. Tuned examples: `Sparse reward CartPole <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/alpha_zero/cartpole-sparse-rewards-alpha-zero.yaml>`__
  592. **AlphaZero-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  593. .. autoclass:: ray.rllib.algorithms.alpha_zero.alpha_zero.AlphaZeroConfig
  594. :members: training
  595. .. _leelachesszero:
  596. MultiAgent LeelaChessZero (LeelaChessZero)
  597. ------------------------------------------
  598. |pytorch|
  599. `[source] <https://github.com/LeelaChessZero/lc0/>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/leela_chess_zero>`__ LeelaChessZero is an RL agent originally inspired by AlphaZero for playing chess. This version adapts it to handle a MultiAgent competitive environment of chess. The code can be scaled to any number of workers.
  600. Tuned examples: tbd
  601. **LeelaChessZero-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
  602. .. autoclass:: ray.rllib.algorithms.leela_chess_zero.leela_chess_zero.LeelaChessZeroConfig
  603. :members: training
  604. .. _curiosity:
  605. Curiosity (ICM: Intrinsic Curiosity Module)
  606. -------------------------------------------
  607. |pytorch|
  608. `[paper] <https://arxiv.org/pdf/1705.05363.pdf>`__
  609. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/curiosity.py>`__
  610. Tuned examples:
  611. `Pyramids (Unity3D) <https://github.com/ray-project/ray/blob/master/rllib/examples/unity3d_env_local.py>`__ (use ``--env Pyramids`` command line option)
  612. `Test case with MiniGrid example <https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/tests/test_curiosity.py#L184>`__ (UnitTest case: ``test_curiosity_on_partially_observable_domain``)
  613. **Activating Curiosity**
  614. The curiosity plugin can be easily activated by specifying it as the Exploration class to-be-used
  615. in the main Algorithm config. Most of its parameters usually do not have to be specified
  616. as the module uses the values from the paper by default. For example:
  617. .. code-block:: python
  618. config = ppo.DEFAULT_CONFIG.copy()
  619. config["num_workers"] = 0
  620. config["exploration_config"] = {
  621. "type": "Curiosity", # <- Use the Curiosity module for exploring.
  622. "eta": 1.0, # Weight for intrinsic rewards before being added to extrinsic ones.
  623. "lr": 0.001, # Learning rate of the curiosity (ICM) module.
  624. "feature_dim": 288, # Dimensionality of the generated feature vectors.
  625. # Setup of the feature net (used to encode observations into feature (latent) vectors).
  626. "feature_net_config": {
  627. "fcnet_hiddens": [],
  628. "fcnet_activation": "relu",
  629. },
  630. "inverse_net_hiddens": [256], # Hidden layers of the "inverse" model.
  631. "inverse_net_activation": "relu", # Activation of the "inverse" model.
  632. "forward_net_hiddens": [256], # Hidden layers of the "forward" model.
  633. "forward_net_activation": "relu", # Activation of the "forward" model.
  634. "beta": 0.2, # Weight for the "forward" loss (beta) over the "inverse" loss (1.0 - beta).
  635. # Specify, which exploration sub-type to use (usually, the algo's "default"
  636. # exploration, e.g. EpsilonGreedy for DQN, StochasticSampling for PG/SAC).
  637. "sub_exploration": {
  638. "type": "StochasticSampling",
  639. }
  640. }
  641. **Functionality**
  642. RLlib's Curiosity is based on `"ICM" (intrinsic curiosity module) described in this paper here <https://arxiv.org/pdf/1705.05363.pdf>`__.
  643. It allows agents to learn in sparse-reward- or even no-reward environments by
  644. calculating so-called "intrinsic rewards", purely based on the information content that is incoming via the observation channel.
  645. Sparse-reward environments are envs where almost all reward signals are 0.0, such as these `[MiniGrid env examples here] <https://github.com/maximecb/gym-minigrid>`__.
  646. In such environments, agents have to navigate (and change the underlying state of the environment) over long periods of time, without receiving much (or any) feedback.
  647. For example, the task could be to find a key in some room, pick it up, find a matching door (matching the color of the key), and eventually unlock this door with the key to reach a goal state,
  648. all the while not seeing any rewards.
  649. Such problems are impossible to solve with standard RL exploration methods like epsilon-greedy or stochastic sampling.
  650. The Curiosity module - when configured as the Exploration class to use via the Algorithm's config (see above on how to do this) - automatically adds three simple models to the Policy's ``self.model``:
  651. a) a latent space learning ("feature") model, taking an environment observation and outputting a latent vector, which represents this observation and
  652. b) a "forward" model, predicting the next latent vector, given the current observation vector and an action to take next.
  653. c) a so-called "inverse" net, only used to train the "feature" net. The inverse net tries to predict the action taken between two latent vectors (obs and next obs).
  654. All the above extra Models are trained inside the modified ``Exploration.postprocess_trajectory()`` call.
  655. Using the (ever changing) "forward" model, our Curiosity module calculates an artificial (intrinsic) reward signal, weights it via the ``eta`` parameter, and then adds it to the environment's (extrinsic) reward.
  656. Intrinsic rewards for each env-step are calculated by taking the euclidian distance between the latent-space encoded next observation ("feature" model) and the **predicted** latent-space encoding for the next observation
  657. ("forward" model).
  658. This allows the agent to explore areas of the environment, where the "forward" model still performs poorly (are not "understood" yet), whereas exploration to these areas will taper down after the agent has visited them
  659. often: The "forward" model will eventually get better at predicting these next latent vectors, which in turn will diminish the intrinsic rewards (decrease the euclidian distance between predicted and actual vectors).
  660. .. _re3:
  661. RE3 (Random Encoders for Efficient Exploration)
  662. -----------------------------------------------
  663. |tensorflow|
  664. `[paper] <https://arxiv.org/pdf/2102.09430.pdf>`__
  665. `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/random_encoder.py>`__
  666. Examples:
  667. `LunarLanderContinuous-v2 <https://github.com/ray-project/ray/blob/master/rllib/examples/re3_exploration.py>`__ (use ``--env LunarLanderContinuous-v2`` command line option)
  668. `Test case with Pendulum-v1 example <https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/tests/test_random_encoder.py>`__
  669. **Activating RE3**
  670. The RE3 plugin can be easily activated by specifying it as the Exploration class to-be-used
  671. in the main Algorithm config and inheriting the `RE3UpdateCallbacks` as shown in this `example <https://github.com/ray-project/ray/blob/c9c3f0745a9291a4de0872bdfa69e4ffdfac3657/rllib/utils/exploration/tests/test_random_encoder.py#L35>`__. Most of its parameters usually do not have to be specified as the module uses the values from the paper by default. For example:
  672. .. code-block:: python
  673. config = sac.DEFAULT_CONFIG.copy()
  674. config["env"] = "Pendulum-v1"
  675. config["seed"] = 12345
  676. config["callbacks"] = RE3Callbacks
  677. config["exploration_config"] = {
  678. "type": "RE3",
  679. # the dimensionality of the observation embedding vectors in latent space.
  680. "embeds_dim": 128,
  681. "rho": 0.1, # Beta decay factor, used for on-policy algorithm.
  682. "k_nn": 50, # Number of neighbours to set for K-NN entropy estimation.
  683. # Configuration for the encoder network, producing embedding vectors from observations.
  684. # This can be used to configure fcnet- or conv_net setups to properly process any
  685. # observation space. By default uses the Policy model configuration.
  686. "encoder_net_config": {
  687. "fcnet_hiddens": [],
  688. "fcnet_activation": "relu",
  689. },
  690. # Hyperparameter to choose between exploration and exploitation. A higher value of beta adds
  691. # more importance to the intrinsic reward, as per the following equation
  692. # `reward = r + beta * intrinsic_reward`
  693. "beta": 0.2,
  694. # Schedule to use for beta decay, one of constant" or "linear_decay".
  695. "beta_schedule": 'constant',
  696. # Specify, which exploration sub-type to use (usually, the algo's "default"
  697. # exploration, e.g. EpsilonGreedy for DQN, StochasticSampling for PG/SAC).
  698. "sub_exploration": {
  699. "type": "StochasticSampling",
  700. }
  701. }
  702. **Functionality**
  703. RLlib's RE3 is based on `"Random Encoders for Efficient Exploration" described in this paper here <https://arxiv.org/pdf/2102.09430.pdf>`__.
  704. RE3 quantifies exploration based on state entropy. The entropy of a state is calculated based on its distance from K nearest neighbor states present in the replay buffer in the latent space (With this implementation, KNN is implemented using training samples from the same batch).
  705. The state entropy is considered as an intrinsic reward and for policy optimization added to the extrinsic reward when available. If the extrinsic reward is not available then the state entropy is used as "intrinsic reward" for unsupervised pre-training of the RL agent.
  706. RE3 further allows agents to learn in sparse-reward or even no-reward environments by
  707. using the state entropy as "intrinsic rewards".
  708. This exploration objective can be used with both model-free and model-based RL algorithms.
  709. RE3 uses a randomly initialized encoder to get the state’s latent representation, thus taking away the complexity of training the representation learning method. The encoder weights are fixed during the entire duration of the training process.
  710. .. _fil:
  711. Fully Independent Learning
  712. --------------------------
  713. `[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__ Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.
  714. Tuned examples: `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py>`__, `multiagent-cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
  715. .. |tensorflow| image:: images/tensorflow.png
  716. :class: inline-figure
  717. :width: 24
  718. .. |pytorch| image:: images/pytorch.png
  719. :class: inline-figure
  720. :width: 24