README.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321
  1. RLlib: Industry-Grade Reinforcement Learning with TF and Torch
  2. ==============================================================
  3. **RLlib** is an open-source library for reinforcement learning (RL), offering support for
  4. production-level, highly distributed RL workloads, while maintaining
  5. unified and simple APIs for a large variety of industry applications.
  6. Whether you would like to train your agents in **multi-agent** setups,
  7. purely from **offline** (historic) datasets, or using **externally
  8. connected simulators**, RLlib offers simple solutions for your decision making needs.
  9. If you either have your problem coded (in python) as an
  10. `RL environment <https://docs.ray.io/en/master/rllib/rllib-env.html#configuring-environments>`_
  11. or own lots of pre-recorded, historic behavioral data to learn from, you will be
  12. up and running in only a few days.
  13. RLlib is already used in production by industry leaders in many different verticals, such as
  14. `climate control <https://www.anyscale.com/events/2021/06/23/applying-ray-and-rllib-to-real-life-industrial-use-cases>`_,
  15. `industrial control <https://www.anyscale.com/events/2021/06/22/offline-rl-with-rllib>`_,
  16. `manufacturing and logistics <https://www.anyscale.com/events/2022/03/29/alphadow-leveraging-rays-ecosystem-to-train-and-deploy-an-rl-industrial>`_,
  17. `finance <https://www.anyscale.com/events/2021/06/22/a-24x-speedup-for-reinforcement-learning-with-rllib-+-ray>`_,
  18. `gaming <https://www.anyscale.com/events/2021/06/22/using-reinforcement-learning-to-optimize-iap-offer-recommendations-in-mobile-games>`_,
  19. `automobile <https://www.anyscale.com/events/2021/06/23/using-rllib-in-an-enterprise-scale-reinforcement-learning-solution>`_,
  20. `robotics <https://www.anyscale.com/events/2021/06/23/introducing-amazon-sagemaker-kubeflow-reinforcement-learning-pipelines-for>`_,
  21. `boat design <https://www.youtube.com/watch?v=cLCK13ryTpw>`_,
  22. and many others.
  23. You can also read about `RLlib Key Concepts. <https://docs.ray.io/en/master/rllib/core-concepts.html>`_
  24. Installation and Setup
  25. ----------------------
  26. Install RLlib and run your first experiment on your laptop in seconds:
  27. **TensorFlow:**
  28. .. code-block:: bash
  29. $ conda create -n rllib python=3.8
  30. $ conda activate rllib
  31. $ pip install "ray[rllib]" tensorflow "gym[atari]" "gym[accept-rom-license]" atari_py
  32. $ # Run a test job:
  33. $ rllib train --run APPO --env CartPole-v0
  34. **PyTorch:**
  35. .. code-block:: bash
  36. $ conda create -n rllib python=3.8
  37. $ conda activate rllib
  38. $ pip install "ray[rllib]" torch "gym[atari]" "gym[accept-rom-license]" atari_py
  39. $ # Run a test job:
  40. $ rllib train --run APPO --env CartPole-v0 --torch
  41. Algorithms Supported
  42. ----------------------
  43. Offline RL:
  44. - `Behavior Cloning (BC; derived from MARWIL implementation) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#bc>`__
  45. - `Conservative Q-Learning (CQL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#cql>`__
  46. - `Critic Regularized Regression (CRR) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#crr>`__
  47. - `Importance Sampling and Weighted Importance Sampling (OPE) <https://docs.ray.io/en/latest/rllib/rllib-offline.html#is>`__
  48. - `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#marwil>`__
  49. Model-free On-policy RL:
  50. - `Synchronous Proximal Policy Optimization (APPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#appo>`__
  51. - `Decentralized Distributed Proximal Policy Optimization (DD-PPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ddppo>`__
  52. - `Proximal Policy Optimization (PPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>`__
  53. - `Importance Weighted Actor-Learner Architecture (IMPALA) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#impala>`__
  54. - `Advantage Actor-Critic (A2C, A3C) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#a3c>`__
  55. - `Vanilla Policy Gradient (PG) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#pg>`__
  56. - `Model-agnostic Meta-Learning (MAML) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#maml>`__
  57. Model-free Off-policy RL:
  58. - `Distributed Prioritized Experience Replay (Ape-X DQN, Ape-X DDPG)] <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#apex>`__
  59. - `Recurrent Replay Distributed DQN (R2D2) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#r2d2>`__
  60. - `Deep Q Networks (DQN, Rainbow, Parametric DQN) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#dqn>`__
  61. - `Deep Deterministic Policy Gradients (DDPG, TD3) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ddpg>`__
  62. - `Soft Actor Critic (SAC) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#sac>`__
  63. Model-based RL:
  64. - `Image-only Dreamer (Dreamer) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#dreamer>`__
  65. - `Model-Based Meta-Policy-Optimization (MB-MPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#mbmpo>`__
  66. Derivative-free algorithms:
  67. - `Augmented Random Search (ARS) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ars>`__
  68. - `Evolution Strategies (ES) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#es>`__
  69. RL for recommender systems:
  70. - `SlateQ <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#slateq>`__
  71. Bandits:
  72. - `Linear Upper Confidence Bound (BanditLinUCB) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#lin-ucb>`__
  73. - `Linear Thompson Sampling (BanditLinTS) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#lints>`__
  74. Multi-agent:
  75. - `Parameter Sharing <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#parameter>`__
  76. - `QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#qmix>`__
  77. - `Multi-Agent Deep Deterministic Policy Gradient (MADDPG) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#maddpg>`__
  78. - `Shared Critic Methods <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#sc>`__
  79. Others:
  80. - `Single-Player Alpha Zero (AlphaZero) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#alphazero>`__
  81. - `Curiosity (ICM: Intrinsic Curiosity Module) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#curiosity>`__
  82. - `Random encoders (RE3) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#re3>`__
  83. - `Fully Independent Learning <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#fil>`__
  84. A list of all the algorithms can be found `here <https://docs.ray.io/en/master/rllib/rllib-algorithms.html>`__ .
  85. Quick First Experiment
  86. ----------------------
  87. .. code-block:: python
  88. import gymnasium as gym
  89. from ray.rllib.algorithms.ppo import PPOConfig
  90. # Define your problem using python and Farama-Foundation's gymnasium API:
  91. class ParrotEnv(gym.Env):
  92. """Environment in which an agent must learn to repeat the seen observations.
  93. Observations are float numbers indicating the to-be-repeated values,
  94. e.g. -1.0, 5.1, or 3.2.
  95. The action space is always the same as the observation space.
  96. Rewards are r=-abs(observation - action), for all steps.
  97. """
  98. def __init__(self, config):
  99. # Make the space (for actions and observations) configurable.
  100. self.action_space = config.get(
  101. "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(1, )))
  102. # Since actions should repeat observations, their spaces must be the
  103. # same.
  104. self.observation_space = self.action_space
  105. self.cur_obs = None
  106. self.episode_len = 0
  107. def reset(self, *, seed=None, options=None):
  108. """Resets the episode and returns the initial observation of the new one.
  109. """
  110. # Reset the episode len.
  111. self.episode_len = 0
  112. # Sample a random number from our observation space.
  113. self.cur_obs = self.observation_space.sample()
  114. # Return initial observation.
  115. return self.cur_obs, {}
  116. def step(self, action):
  117. """Takes a single step in the episode given `action`
  118. Returns:
  119. New observation, reward, done-flag, info-dict (empty).
  120. """
  121. # Set `truncated` flag after 10 steps.
  122. self.episode_len += 1
  123. terminated = False
  124. truncated = self.episode_len >= 10
  125. # r = -abs(obs - action)
  126. reward = -sum(abs(self.cur_obs - action))
  127. # Set a new observation (random sample).
  128. self.cur_obs = self.observation_space.sample()
  129. return self.cur_obs, reward, terminated, truncated, {}
  130. # Create an RLlib Algorithm instance from a PPOConfig to learn how to
  131. # act in the above environment.
  132. config = (
  133. PPOConfig()
  134. .environment(
  135. # Env class to use (here: our gym.Env sub-class from above).
  136. env=ParrotEnv,
  137. # Config dict to be passed to our custom env's constructor.
  138. env_config={
  139. "parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
  140. },
  141. )
  142. # Parallelize environment rollouts.
  143. .rollouts(num_rollout_workers=3)
  144. )
  145. # Use the config's `build()` method to construct a PPO object.
  146. algo = config.build()
  147. # Train for n iterations and report results (mean episode rewards).
  148. # Since we have to guess 10 times and the optimal reward is 0.0
  149. # (exact match between observation and action value),
  150. # we can expect to reach an optimal episode reward of 0.0.
  151. for i in range(5):
  152. results = algo.train()
  153. print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")
  154. After training, you may want to perform action computations (inference) in your environment.
  155. Below is a minimal example on how to do this. Also
  156. `check out our more detailed examples here <https://github.com/ray-project/ray/tree/master/rllib/examples/inference_and_serving>`_
  157. (in particular for `normal models <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training.py>`_,
  158. `LSTMs <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_lstm.py>`_,
  159. and `attention nets <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_attention.py>`_).
  160. .. code-block:: python
  161. # Perform inference (action computations) based on given env observations.
  162. # Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
  163. # of -5.0 to 5.0!), however, this should still work as the agent has
  164. # (hopefully) learned to "just always repeat the observation!".
  165. env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, (1, ))})
  166. # Get the initial observation (some value between -10.0 and 10.0).
  167. obs, info = env.reset()
  168. terminated = truncated = False
  169. total_reward = 0.0
  170. # Play one episode.
  171. while not terminated and not truncated:
  172. # Compute a single action, given the current observation
  173. # from the environment.
  174. action = algo.compute_single_action(obs)
  175. # Apply the computed action in the environment.
  176. obs, reward, terminated, truncated, info = env.step(action)
  177. # Sum up rewards for reporting purposes.
  178. total_reward += reward
  179. # Report results.
  180. print(f"Shreaked for 1 episode; total-reward={total_reward}")
  181. For a more detailed `"60 second" example, head to our main documentation <https://docs.ray.io/en/master/rllib/index.html>`_.
  182. Highlighted Features
  183. --------------------
  184. The following is a summary of RLlib's most striking features (for an in-depth overview,
  185. check out our `documentation <http://docs.ray.io/en/master/rllib/index.html>`_):
  186. The most **popular deep-learning frameworks**: `PyTorch <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_torch_policy.py>`_ and `TensorFlow
  187. (tf1.x/2.x static-graph/eager/traced) <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py>`_.
  188. **Highly distributed learning**: Our RLlib algorithms (such as our "PPO" or "IMPALA")
  189. allow you to set the ``num_workers`` config parameter, such that your workloads can run
  190. on 100s of CPUs/nodes thus parallelizing and speeding up learning.
  191. **Vectorized (batched) and remote (parallel) environments**: RLlib auto-vectorizes
  192. your ``gym.Envs`` via the ``num_envs_per_worker`` config. Environment workers can
  193. then batch and thus significantly speedup the action computing forward pass.
  194. On top of that, RLlib offers the ``remote_worker_envs`` config to create
  195. `single environments (within a vectorized one) as ray Actors <https://github.com/ray-project/ray/blob/master/rllib/examples/remote_base_env_with_custom_api.py>`_,
  196. thus parallelizing even the env stepping process.
  197. | **Multi-agent RL** (MARL): Convert your (custom) ``gym.Envs`` into a multi-agent one
  198. via a few simple steps and start training your agents in any of the following fashions:
  199. | 1) Cooperative with `shared <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic.py>`_ or
  200. `separate <https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py>`_
  201. policies and/or value functions.
  202. | 2) Adversarial scenarios using `self-play <https://github.com/ray-project/ray/blob/master/rllib/examples/self_play_with_open_spiel.py>`_
  203. and `league-based training <https://github.com/ray-project/ray/blob/master/rllib/examples/self_play_league_based_with_open_spiel.py>`_.
  204. | 3) `Independent learning <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py>`_
  205. of neutral/co-existing agents.
  206. **External simulators**: Don't have your simulation running as a gym.Env in python?
  207. No problem! RLlib supports an external environment API and comes with a pluggable,
  208. off-the-shelve
  209. `client <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_client.py>`_/
  210. `server <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_server.py>`_
  211. setup that allows you to run 100s of independent simulators on the "outside"
  212. (e.g. a Windows cloud) connecting to a central RLlib Policy-Server that learns
  213. and serves actions. Alternatively, actions can be computed on the client side
  214. to save on network traffic.
  215. **Offline RL and imitation learning/behavior cloning**: You don't have a simulator
  216. for your particular problem, but tons of historic data recorded by a legacy (maybe
  217. non-RL/ML) system? This branch of reinforcement learning is for you!
  218. RLlib's comes with several `offline RL <https://github.com/ray-project/ray/blob/master/rllib/examples/offline_rl.py>`_
  219. algorithms (*CQL*, *MARWIL*, and *DQfD*), allowing you to either purely
  220. `behavior-clone <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bc/tests/test_bc.py>`_
  221. your existing system or learn how to further improve over it.
  222. In-Depth Documentation
  223. ----------------------
  224. For an in-depth overview of RLlib and everything it has to offer, including
  225. hand-on tutorials of important industry use cases and workflows, head over to
  226. our `documentation pages <https://docs.ray.io/en/master/rllib/index.html>`_.
  227. Cite our Paper
  228. --------------
  229. If you've found RLlib useful for your research, please cite our `paper <https://arxiv.org/abs/1712.09381>`_ as follows:
  230. .. code-block::
  231. @inproceedings{liang2018rllib,
  232. Author = {Eric Liang and
  233. Richard Liaw and
  234. Robert Nishihara and
  235. Philipp Moritz and
  236. Roy Fox and
  237. Ken Goldberg and
  238. Joseph E. Gonzalez and
  239. Michael I. Jordan and
  240. Ion Stoica},
  241. Title = {{RLlib}: Abstractions for Distributed Reinforcement Learning},
  242. Booktitle = {International Conference on Machine Learning ({ICML})},
  243. Year = {2018}
  244. }