README.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292
  1. RLlib: Industry-Grade Reinforcement Learning with TF and Torch
  2. ==============================================================
  3. **RLlib** is an open-source library for reinforcement learning (RL), offering support for
  4. production-level, highly distributed RL workloads, while maintaining
  5. unified and simple APIs for a large variety of industry applications.
  6. Whether you would like to train your agents in **multi-agent** setups,
  7. purely from **offline** (historic) datasets, or using **externally
  8. connected simulators**, RLlib offers simple solutions for your decision making needs.
  9. If you either have your problem coded (in python) as an
  10. `RL environment <https://docs.ray.io/en/master/rllib/rllib-env.html#configuring-environments>`_
  11. or own lots of pre-recorded, historic behavioral data to learn from, you will be
  12. up and running in only a few days.
  13. RLlib is already used in production by industry leaders in many different verticals, such as
  14. `climate control <https://www.anyscale.com/events/2021/06/23/applying-ray-and-rllib-to-real-life-industrial-use-cases>`_,
  15. `industrial control <https://www.anyscale.com/events/2021/06/22/offline-rl-with-rllib>`_,
  16. `manufacturing and logistics <https://www.anyscale.com/events/2022/03/29/alphadow-leveraging-rays-ecosystem-to-train-and-deploy-an-rl-industrial>`_,
  17. `finance <https://www.anyscale.com/events/2021/06/22/a-24x-speedup-for-reinforcement-learning-with-rllib-+-ray>`_,
  18. `gaming <https://www.anyscale.com/events/2021/06/22/using-reinforcement-learning-to-optimize-iap-offer-recommendations-in-mobile-games>`_,
  19. `automobile <https://www.anyscale.com/events/2021/06/23/using-rllib-in-an-enterprise-scale-reinforcement-learning-solution>`_,
  20. `robotics <https://www.anyscale.com/events/2021/06/23/introducing-amazon-sagemaker-kubeflow-reinforcement-learning-pipelines-for>`_,
  21. `boat design <https://www.youtube.com/watch?v=cLCK13ryTpw>`_,
  22. and many others.
  23. You can also read about `RLlib Key Concepts. <https://docs.ray.io/en/master/rllib/core-concepts.html>`_
  24. Installation and Setup
  25. ----------------------
  26. Install RLlib and run your first experiment on your laptop in seconds:
  27. **PyTorch:**
  28. .. code-block:: bash
  29. $ conda create -n rllib python=3.11
  30. $ conda activate rllib
  31. $ pip install "ray[rllib]" torch "gymnasium[atari]" "gymnasium[accept-rom-license]" atari_py
  32. $ # Run a test job (assuming you are in the `ray` pip-installed directory):
  33. $ cd rllib/examples/inference/
  34. $ python policy_inference_after_training.py --stop-reward=100.0
  35. Algorithms Supported
  36. ----------------------
  37. Model-free On-policy RL:
  38. - `Synchronous Proximal Policy Optimization (APPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#appo>`__
  39. - `Proximal Policy Optimization (PPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>`__
  40. - `Importance Weighted Actor-Learner Architecture (IMPALA) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#impala>`__
  41. Model-free Off-policy RL:
  42. - `Deep Q Networks (DQN, Rainbow, Parametric DQN) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#dqn>`__
  43. - `Soft Actor Critic (SAC) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#sac>`__
  44. Model-based RL:
  45. - `DreamerV3 <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#dreamerv3>`__
  46. Offline RL:
  47. - `Behavior Cloning (BC; derived from MARWIL implementation) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#bc>`__
  48. - `Conservative Q-Learning (CQL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#cql>`__
  49. - `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#marwil>`__
  50. Multi-agent:
  51. - `Parameter Sharing <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#parameter>`__
  52. - `Shared Critic Methods <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#sc>`__
  53. Others:
  54. - `Fully Independent Learning <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#fil>`__
  55. A list of all the algorithms can be found `here <https://docs.ray.io/en/master/rllib/rllib-algorithms.html>`__ .
  56. Quick First Experiment
  57. ----------------------
  58. .. testcode::
  59. import gymnasium as gym
  60. from ray.rllib.algorithms.ppo import PPOConfig
  61. # Define your problem using python and Farama-Foundation's gymnasium API:
  62. class ParrotEnv(gym.Env):
  63. """Environment in which an agent must learn to repeat the seen observations.
  64. Observations are float numbers indicating the to-be-repeated values,
  65. e.g. -1.0, 5.1, or 3.2.
  66. The action space is always the same as the observation space.
  67. Rewards are r=-abs(observation - action), for all steps.
  68. """
  69. def __init__(self, config):
  70. # Make the space (for actions and observations) configurable.
  71. self.action_space = config.get(
  72. "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(1, )))
  73. # Since actions should repeat observations, their spaces must be the
  74. # same.
  75. self.observation_space = self.action_space
  76. self.cur_obs = None
  77. self.episode_len = 0
  78. def reset(self, *, seed=None, options=None):
  79. """Resets the episode and returns the initial observation of the new one.
  80. """
  81. # Reset the episode len.
  82. self.episode_len = 0
  83. # Sample a random number from our observation space.
  84. self.cur_obs = self.observation_space.sample()
  85. # Return initial observation.
  86. return self.cur_obs, {}
  87. def step(self, action):
  88. """Takes a single step in the episode given `action`
  89. Returns:
  90. New observation, reward, done-flag, info-dict (empty).
  91. """
  92. # Set `truncated` flag after 10 steps.
  93. self.episode_len += 1
  94. terminated = False
  95. truncated = self.episode_len >= 10
  96. # r = -abs(obs - action)
  97. reward = -sum(abs(self.cur_obs - action))
  98. # Set a new observation (random sample).
  99. self.cur_obs = self.observation_space.sample()
  100. return self.cur_obs, reward, terminated, truncated, {}
  101. # Create an RLlib Algorithm instance from a PPOConfig to learn how to
  102. # act in the above environment.
  103. config = (
  104. PPOConfig()
  105. .environment(
  106. # Env class to use (here: our gym.Env sub-class from above).
  107. env=ParrotEnv,
  108. # Config dict to be passed to our custom env's constructor.
  109. env_config={
  110. "parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
  111. },
  112. )
  113. # Parallelize environment sampling.
  114. .env_runners(num_env_runners=3)
  115. )
  116. # Use the config's `build()` method to construct a PPO object.
  117. algo = config.build()
  118. # Train for n iterations and report results (mean episode rewards).
  119. # Since we have to guess 10 times and the optimal reward is 0.0
  120. # (exact match between observation and action value),
  121. # we can expect to reach an optimal episode reward of 0.0.
  122. for i in range(1):
  123. results = algo.train()
  124. print(f"Iter: {i}; avg. return={results['env_runners/episode_return_mean']}")
  125. .. testoutput::
  126. :options: +MOCK
  127. Iter: 0; avg. reward=-41.88662799871655
  128. After training, you may want to perform action computations (inference) in your environment.
  129. Below is a minimal example on how to do this. Also
  130. `check out our more detailed examples here <https://github.com/ray-project/ray/tree/master/rllib/examples/inference_and_serving>`_
  131. (in particular for `normal models <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training.py>`_,
  132. `LSTMs <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_lstm.py>`_,
  133. and `attention nets <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_attention.py>`_).
  134. .. testcode::
  135. # Perform inference (action computations) based on given env observations.
  136. # Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
  137. # of -5.0 to 5.0!), however, this should still work as the agent has
  138. # (hopefully) learned to "just always repeat the observation!".
  139. env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, (1, ))})
  140. # Get the initial observation (some value between -10.0 and 10.0).
  141. obs, info = env.reset()
  142. terminated = truncated = False
  143. total_reward = 0.0
  144. # Play one episode.
  145. while not terminated and not truncated:
  146. # Compute a single action, given the current observation
  147. # from the environment.
  148. action = algo.compute_single_action(obs)
  149. # Apply the computed action in the environment.
  150. obs, reward, terminated, truncated, info = env.step(action)
  151. # Sum up rewards for reporting purposes.
  152. total_reward += reward
  153. # Report results.
  154. print(f"Shreaked for 1 episode; total-reward={total_reward}")
  155. .. testoutput::
  156. :options: +MOCK
  157. Shreaked for 1 episode; total-reward=-0.001
  158. For a more detailed `"60 second" example, head to our main documentation <https://docs.ray.io/en/master/rllib/index.html>`_.
  159. Highlighted Features
  160. --------------------
  161. The following is a summary of RLlib's most striking features (for an in-depth overview,
  162. check out our `documentation <http://docs.ray.io/en/master/rllib/index.html>`_):
  163. The most **popular deep-learning frameworks**: `PyTorch <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_torch_policy.py>`_ and `TensorFlow
  164. (tf1.x/2.x static-graph/eager/traced) <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py>`_.
  165. **Highly distributed learning**: Our RLlib algorithms (such as our "PPO" or "IMPALA")
  166. allow you to set the ``num_env_runners`` config parameter, such that your workloads can run
  167. on 100s of CPUs/nodes thus parallelizing and speeding up learning.
  168. **Vectorized (batched) and remote (parallel) environments**: RLlib auto-vectorizes
  169. your ``gym.Envs`` via the ``num_envs_per_env_runner`` config. Environment workers can
  170. then batch and thus significantly speedup the action computing forward pass.
  171. On top of that, RLlib offers the ``remote_worker_envs`` config to create
  172. `single environments (within a vectorized one) as ray Actors <https://github.com/ray-project/ray/blob/master/rllib/examples/remote_base_env_with_custom_api.py>`_,
  173. thus parallelizing even the env stepping process.
  174. | **Multi-agent RL** (MARL): Convert your (custom) ``gym.Envs`` into a multi-agent one
  175. via a few simple steps and start training your agents in any of the following fashions:
  176. | 1) Cooperative with `shared <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic.py>`_ or
  177. `separate <https://github.com/ray-project/ray/blob/master/rllib/examples/two_step_game.py>`_
  178. policies and/or value functions.
  179. | 2) Adversarial scenarios using `self-play <https://github.com/ray-project/ray/blob/master/rllib/examples/self_play_with_open_spiel.py>`_
  180. and `league-based training <https://github.com/ray-project/ray/blob/master/rllib/examples/self_play_league_based_with_open_spiel.py>`_.
  181. | 3) `Independent learning <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py>`_
  182. of neutral/co-existing agents.
  183. **External simulators**: Don't have your simulation running as a gym.Env in python?
  184. No problem! RLlib supports an external environment API and comes with a pluggable,
  185. off-the-shelve
  186. `client <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/external_envs/cartpole_client.py>`_/
  187. `server <https://github.com/ray-project/ray/blob/master/rllib/examples/envs/external_envs/cartpole_server.py>`_
  188. setup that allows you to run 100s of independent simulators on the "outside"
  189. (e.g. a Windows cloud) connecting to a central RLlib Policy-Server that learns
  190. and serves actions. Alternatively, actions can be computed on the client side
  191. to save on network traffic.
  192. **Offline RL and imitation learning/behavior cloning**: You don't have a simulator
  193. for your particular problem, but tons of historic data recorded by a legacy (maybe
  194. non-RL/ML) system? This branch of reinforcement learning is for you!
  195. RLlib's comes with several `offline RL <https://github.com/ray-project/ray/blob/master/rllib/examples/offline_rl.py>`_
  196. algorithms (*CQL*, *MARWIL*, and *DQfD*), allowing you to either purely
  197. `behavior-clone <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bc/tests/test_bc.py>`_
  198. your existing system or learn how to further improve over it.
  199. In-Depth Documentation
  200. ----------------------
  201. For an in-depth overview of RLlib and everything it has to offer, including
  202. hand-on tutorials of important industry use cases and workflows, head over to
  203. our `documentation pages <https://docs.ray.io/en/master/rllib/index.html>`_.
  204. Cite our Paper
  205. --------------
  206. If you've found RLlib useful for your research, please cite our `paper <https://arxiv.org/abs/1712.09381>`_ as follows:
  207. .. code-block::
  208. @inproceedings{liang2018rllib,
  209. Author = {Eric Liang and
  210. Richard Liaw and
  211. Robert Nishihara and
  212. Philipp Moritz and
  213. Roy Fox and
  214. Ken Goldberg and
  215. Joseph E. Gonzalez and
  216. Michael I. Jordan and
  217. Ion Stoica},
  218. Title = {{RLlib}: Abstractions for Distributed Reinforcement Learning},
  219. Booktitle = {International Conference on Machine Learning ({ICML})},
  220. Year = {2018}
  221. }