  1. RLlib: Industry-Grade Reinforcement Learning with TF and Torch
  2. ==============================================================
  3. **RLlib** is an open-source library for reinforcement learning (RL), offering support for
  4. production-level, highly distributed RL workloads, while maintaining
  5. unified and simple APIs for a large variety of industry applications.
  6. Whether you would like to train your agents in **multi-agent** setups,
  7. purely from **offline** (historic) datasets, or using **externally
  8. connected simulators**, RLlib offers simple solutions for your decision making needs.
  9. If you either have your problem coded (in python) as an
  10. `RL environment <>`_
  11. or own lots of pre-recorded, historic behavioral data to learn from, you will be
  12. up and running in only a few days.
  13. RLlib is already used in production by industry leaders in many different verticals, such as
  14. `climate control <>`_,
  15. `industrial control <>`_,
  16. `manufacturing and logistics <>`_,
  17. `finance <>`_,
  18. `gaming <>`_,
  19. `automobile <>`_,
  20. `robotics <>`_,
  21. `boat design <>`_,
  22. and many others.
  23. You can also read about `RLlib Key Concepts. <>`_
  24. Installation and Setup
  25. ----------------------
  26. Install RLlib and run your first experiment on your laptop in seconds:
  27. **PyTorch:**
  28. .. code-block:: bash
  29. $ conda create -n rllib python=3.11
  30. $ conda activate rllib
  31. $ pip install "ray[rllib]" torch "gymnasium[atari]" "gymnasium[accept-rom-license]" atari_py
  32. $ # Run a test job (assuming you are in the `ray` pip-installed directory):
  33. $ cd rllib/examples/inference/
  34. $ python --stop-reward=100.0
  35. Algorithms Supported
  36. ----------------------
  37. Model-free On-policy RL:
  38. - `Synchronous Proximal Policy Optimization (APPO) <>`__
  39. - `Proximal Policy Optimization (PPO) <>`__
  40. - `Importance Weighted Actor-Learner Architecture (IMPALA) <>`__
  41. Model-free Off-policy RL:
  42. - `Deep Q Networks (DQN, Rainbow, Parametric DQN) <>`__
  43. - `Soft Actor Critic (SAC) <>`__
  44. Model-based RL:
  45. - `DreamerV3 <>`__
  46. Offline RL:
  47. - `Behavior Cloning (BC; derived from MARWIL implementation) <>`__
  48. - `Conservative Q-Learning (CQL) <>`__
  49. - `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) <>`__
  50. Multi-agent:
  51. - `Parameter Sharing <>`__
  52. - `Shared Critic Methods <>`__
  53. Others:
  54. - `Fully Independent Learning <>`__
  55. A list of all the algorithms can be found `here <>`__ .
  56. Quick First Experiment
  57. ----------------------
  58. .. testcode::
  59. import gymnasium as gym
  60. from ray.rllib.algorithms.ppo import PPOConfig
  61. # Define your problem using python and Farama-Foundation's gymnasium API:
  62. class ParrotEnv(gym.Env):
  63. """Environment in which an agent must learn to repeat the seen observations.
  64. Observations are float numbers indicating the to-be-repeated values,
  65. e.g. -1.0, 5.1, or 3.2.
  66. The action space is always the same as the observation space.
  67. Rewards are r=-abs(observation - action), for all steps.
  68. """
  69. def __init__(self, config):
  70. # Make the space (for actions and observations) configurable.
  71. self.action_space = config.get(
  72. "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(1, )))
  73. # Since actions should repeat observations, their spaces must be the
  74. # same.
  75. self.observation_space = self.action_space
  76. self.cur_obs = None
  77. self.episode_len = 0
  78. def reset(self, *, seed=None, options=None):
  79. """Resets the episode and returns the initial observation of the new one.
  80. """
  81. # Reset the episode len.
  82. self.episode_len = 0
  83. # Sample a random number from our observation space.
  84. self.cur_obs = self.observation_space.sample()
  85. # Return initial observation.
  86. return self.cur_obs, {}
  87. def step(self, action):
  88. """Takes a single step in the episode given `action`
  89. Returns:
  90. New observation, reward, done-flag, info-dict (empty).
  91. """
  92. # Set `truncated` flag after 10 steps.
  93. self.episode_len += 1
  94. terminated = False
  95. truncated = self.episode_len >= 10
  96. # r = -abs(obs - action)
  97. reward = -sum(abs(self.cur_obs - action))
  98. # Set a new observation (random sample).
  99. self.cur_obs = self.observation_space.sample()
  100. return self.cur_obs, reward, terminated, truncated, {}
  101. # Create an RLlib Algorithm instance from a PPOConfig to learn how to
  102. # act in the above environment.
  103. config = (
  104. PPOConfig()
  105. .environment(
  106. # Env class to use (here: our gym.Env sub-class from above).
  107. env=ParrotEnv,
  108. # Config dict to be passed to our custom env's constructor.
  109. env_config={
  110. "parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
  111. },
  112. )
  113. # Parallelize environment sampling.
  114. .env_runners(num_env_runners=3)
  115. )
  116. # Use the config's `build()` method to construct a PPO object.
  117. algo =
  118. # Train for n iterations and report results (mean episode rewards).
  119. # Since we have to guess 10 times and the optimal reward is 0.0
  120. # (exact match between observation and action value),
  121. # we can expect to reach an optimal episode reward of 0.0.
  122. for i in range(1):
  123. results = algo.train()
  124. print(f"Iter: {i}; avg. return={results['env_runners/episode_return_mean']}")
  125. .. testoutput::
  126. :options: +MOCK
  127. Iter: 0; avg. reward=-41.88662799871655
  128. After training, you may want to perform action computations (inference) in your environment.
  129. Below is a minimal example on how to do this. Also
  130. `check out our more detailed examples here <>`_
  131. (in particular for `normal models <>`_,
  132. `LSTMs <>`_,
  133. and `attention nets <>`_).
  134. .. testcode::
  135. # Perform inference (action computations) based on given env observations.
  136. # Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
  137. # of -5.0 to 5.0!), however, this should still work as the agent has
  138. # (hopefully) learned to "just always repeat the observation!".
  139. env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, (1, ))})
  140. # Get the initial observation (some value between -10.0 and 10.0).
  141. obs, info = env.reset()
  142. terminated = truncated = False
  143. total_reward = 0.0
  144. # Play one episode.
  145. while not terminated and not truncated:
  146. # Compute a single action, given the current observation
  147. # from the environment.
  148. action = algo.compute_single_action(obs)
  149. # Apply the computed action in the environment.
  150. obs, reward, terminated, truncated, info = env.step(action)
  151. # Sum up rewards for reporting purposes.
  152. total_reward += reward
  153. # Report results.
  154. print(f"Shreaked for 1 episode; total-reward={total_reward}")
  155. .. testoutput::
  156. :options: +MOCK
  157. Shreaked for 1 episode; total-reward=-0.001
  158. For a more detailed `"60 second" example, head to our main documentation <>`_.
  159. Highlighted Features
  160. --------------------
  161. The following is a summary of RLlib's most striking features (for an in-depth overview,
  162. check out our `documentation <>`_):
  163. The most **popular deep-learning frameworks**: `PyTorch <>`_ and `TensorFlow
  164. (tf1.x/2.x static-graph/eager/traced) <>`_.
  165. **Highly distributed learning**: Our RLlib algorithms (such as our "PPO" or "IMPALA")
  166. allow you to set the ``num_env_runners`` config parameter, such that your workloads can run
  167. on 100s of CPUs/nodes thus parallelizing and speeding up learning.
  168. **Vectorized (batched) and remote (parallel) environments**: RLlib auto-vectorizes
  169. your ``gym.Envs`` via the ``num_envs_per_env_runner`` config. Environment workers can
  170. then batch and thus significantly speedup the action computing forward pass.
  171. On top of that, RLlib offers the ``remote_worker_envs`` config to create
  172. `single environments (within a vectorized one) as ray Actors <>`_,
  173. thus parallelizing even the env stepping process.
  174. | **Multi-agent RL** (MARL): Convert your (custom) ``gym.Envs`` into a multi-agent one
  175. via a few simple steps and start training your agents in any of the following fashions:
  176. | 1) Cooperative with `shared <>`_ or
  177. `separate <>`_
  178. policies and/or value functions.
  179. | 2) Adversarial scenarios using `self-play <>`_
  180. and `league-based training <>`_.
  181. | 3) `Independent learning <>`_
  182. of neutral/co-existing agents.
  183. **External simulators**: Don't have your simulation running as a gym.Env in python?
  184. No problem! RLlib supports an external environment API and comes with a pluggable,
  185. off-the-shelve
  186. `client <>`_/
  187. `server <>`_
  188. setup that allows you to run 100s of independent simulators on the "outside"
  189. (e.g. a Windows cloud) connecting to a central RLlib Policy-Server that learns
  190. and serves actions. Alternatively, actions can be computed on the client side
  191. to save on network traffic.
  192. **Offline RL and imitation learning/behavior cloning**: You don't have a simulator
  193. for your particular problem, but tons of historic data recorded by a legacy (maybe
  194. non-RL/ML) system? This branch of reinforcement learning is for you!
  195. RLlib's comes with several `offline RL <>`_
  196. algorithms (*CQL*, *MARWIL*, and *DQfD*), allowing you to either purely
  197. `behavior-clone <>`_
  198. your existing system or learn how to further improve over it.
  199. In-Depth Documentation
  200. ----------------------
  201. For an in-depth overview of RLlib and everything it has to offer, including
  202. hand-on tutorials of important industry use cases and workflows, head over to
  203. our `documentation pages <>`_.
  204. Cite our Paper
  205. --------------
  206. If you've found RLlib useful for your research, please cite our `paper <>`_ as follows:
