rllib-env.rst 32 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599
  1. .. include:: /_includes/rllib/announcement.rst
  2. .. include:: /_includes/rllib/we_are_hiring.rst
  3. .. _rllib-environments-doc:
  4. Environments
  5. ============
  6. RLlib works with several different types of environments, including `Farama-Foundation Gymnasium <https://gymnasium.farama.org/>`__, user-defined, multi-agent, and also batched environments.
  7. .. tip::
  8. Not all environments work with all algorithms. Check out the `algorithm overview <rllib-algorithms.html#available-algorithms-overview>`__ for more information.
  9. .. image:: images/rllib-envs.svg
  10. .. _configuring-environments:
  11. Configuring Environments
  12. ------------------------
  13. You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym `environment name <https://www.gymlibrary.dev/>`__.
  14. Custom env classes passed directly to the algorithm must take a single ``env_config`` parameter in their constructor:
  15. .. code-block:: python
  16. import gymnasium as gym
  17. import ray
  18. from ray.rllib.algorithms import ppo
  19. class MyEnv(gym.Env):
  20. def __init__(self, env_config):
  21. self.action_space = <gym.Space>
  22. self.observation_space = <gym.Space>
  23. def reset(self, seed, options):
  24. return <obs>, <info>
  25. def step(self, action):
  26. return <obs>, <reward: float>, <terminated: bool>, <truncated: bool>, <info: dict>
  27. ray.init()
  28. algo = ppo.PPO(env=MyEnv, config={
  29. "env_config": {}, # config to pass to env class
  30. })
  31. while True:
  32. print(algo.train())
  33. You can also register a custom env creator function with a string name. This function must take a single ``env_config`` (dict) parameter and return an env instance:
  34. .. code-block:: python
  35. from ray.tune.registry import register_env
  36. def env_creator(env_config):
  37. return MyEnv(...) # return an env instance
  38. register_env("my_env", env_creator)
  39. algo = ppo.PPO(env="my_env")
  40. For a full runnable code example using the custom environment API, see `custom_env.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__.
  41. .. warning::
  42. The gymnasium registry is not compatible with Ray. Instead, always use the registration flows documented above to ensure Ray workers can access the environment.
  43. In the above example, note that the ``env_creator`` function takes in an ``env_config`` object.
  44. This is a dict containing options passed in through your algorithm.
  45. You can also access ``env_config.worker_index`` and ``env_config.vector_index`` to get the worker id and env id within the worker (if ``num_envs_per_worker > 0``).
  46. This can be useful if you want to train over an ensemble of different environments, for example:
  47. .. code-block:: python
  48. class MultiEnv(gym.Env):
  49. def __init__(self, env_config):
  50. # pick actual env based on worker and env indexes
  51. self.env = gym.make(
  52. choose_env_for(env_config.worker_index, env_config.vector_index))
  53. self.action_space = self.env.action_space
  54. self.observation_space = self.env.observation_space
  55. def reset(self, seed, options):
  56. return self.env.reset(seed, options)
  57. def step(self, action):
  58. return self.env.step(action)
  59. register_env("multienv", lambda config: MultiEnv(config))
  60. .. tip::
  61. When using logging in an environment, the logging configuration needs to be done inside the environment, which runs inside Ray workers. Any configurations outside the environment, e.g., before starting Ray will be ignored.
  62. Gymnasium
  63. ----------
  64. RLlib uses Gymnasium as its environment interface for single-agent training. For more information on how to implement a custom Gymnasium environment, see the `gymnasium.Env class definition <https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/core.py>`__. You may find the `SimpleCorridor <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__ example useful as a reference.
  65. Performance
  66. ~~~~~~~~~~~
  67. .. tip::
  68. Also check out the `scaling guide <rllib-training.html#scaling-guide>`__ for RLlib training.
  69. There are two ways to scale experience collection with Gym environments:
  70. 1. **Vectorization within a single process:** Though many envs can achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs.
  71. You can configure ``{"num_envs_per_worker": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__.
  72. 2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_workers": N}`` config.
  73. .. image:: images/throughput.png
  74. You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v1 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_worker=64``.
  75. Expensive Environments
  76. ~~~~~~~~~~~~~~~~~~~~~~
  77. Some environments may be very resource-intensive to create. RLlib will create ``num_workers + 1`` copies of the environment since one copy is needed for the driver process. To avoid paying the extra overhead of the driver copy, which is needed to access the env's action and observation spaces, you can defer environment initialization until ``reset()`` is called.
  78. Vectorized
  79. ----------
  80. RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_worker`` config is set, or you can define a custom environment class that subclasses `VectorEnv <https://github.com/ray-project/ray/blob/master/rllib/env/vector_env.py>`__ to implement ``vector_step()`` and ``vector_reset()``.
  81. Note that auto-vectorization only applies to policy inference by default. This means that policy inference will be batched, but your envs will still be stepped one at a time. If you would like your envs to be stepped in parallel, you can set ``"remote_worker_envs": True``. This will create env instances in Ray actors and step them in parallel. These remote processes introduce communication overheads, so this only helps if your env is very expensive to step / reset.
  82. When using remote envs, you can control the batching level for inference with ``remote_env_batch_wait_ms``. The default value of 0ms means envs execute asynchronously and inference is only batched opportunistically. Setting the timeout to a large value will result in fully batched inference and effectively synchronous environment stepping. The optimal value depends on your environment step / reset time, and model inference speed.
  83. Multi-Agent and Hierarchical
  84. ----------------------------
  85. In a multi-agent environment, there are more than one "agent" acting simultaneously, in a turn-based fashion, or in a combination of these two.
  86. For example, in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment,
  87. acting simultaneously. Whereas in a board game, you may have two or more agents acting in a turn-base fashion.
  88. The mental model for multi-agent in RLlib is as follows:
  89. (1) Your environment (a sub-class of :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv`) returns dictionaries mapping agent IDs (e.g. strings; the env can chose these arbitrarily) to individual agents' observations, rewards, and done-flags.
  90. (2) You define (some of) the policies that are available up front (you can also add new policies on-the-fly throughout training), and
  91. (3) You define a function that maps an env-produced agent ID to any available policy ID, which is then to be used for computing actions for this particular agent.
  92. This is summarized by the below figure:
  93. .. image:: images/multi-agent.svg
  94. When implementing your own :py:class:`~ray.rllib.env.multi_agent_env.MultiAgentEnv`, note that you should only return those
  95. agent IDs in an observation dict, for which you expect to receive actions in the next call to `step()`.
  96. This API allows you to implement any type of multi-agent environment, from `turn-based games <https://github.com/ray-project/ray/blob/master/rllib/examples/self_play_with_open_spiel.py>`__
  97. over environments, in which `all agents always act simultaneously <https://github.com/ray-project/ray/blob/master/rllib/examples/env/multi_agent.py>`__, to anything in between.
  98. Here is an example of an env, in which all agents always step simultaneously:
  99. .. code-block:: python
  100. # Env, in which all agents (whose IDs are entirely determined by the env
  101. # itself via the returned multi-agent obs/reward/dones-dicts) step
  102. # simultaneously.
  103. env = MultiAgentTrafficEnv(num_cars=2, num_traffic_lights=1)
  104. # Observations are a dict mapping agent names to their obs. Only those
  105. # agents' names that require actions in the next call to `step()` should
  106. # be present in the returned observation dict (here: all, as we always step
  107. # simultaneously).
  108. print(env.reset())
  109. # ... {
  110. # ... "car_1": [[...]],
  111. # ... "car_2": [[...]],
  112. # ... "traffic_light_1": [[...]],
  113. # ... }
  114. # In the following call to `step`, actions should be provided for each
  115. # agent that returned an observation before:
  116. new_obs, rewards, dones, infos = env.step(
  117. actions={"car_1": ..., "car_2": ..., "traffic_light_1": ...})
  118. # Similarly, new_obs, rewards, dones, etc. also become dicts.
  119. print(rewards)
  120. # ... {"car_1": 3, "car_2": -1, "traffic_light_1": 0}
  121. # Individual agents can early exit; The entire episode is done when
  122. # dones["__all__"] = True.
  123. print(dones)
  124. # ... {"car_2": True, "__all__": False}
  125. And another example, where agents step one after the other (turn-based game):
  126. .. code-block:: python
  127. # Env, in which two agents step in sequence (tuen-based game).
  128. # The env is in charge of the produced agent ID. Our env here produces
  129. # agent IDs: "player1" and "player2".
  130. env = TicTacToe()
  131. # Observations are a dict mapping agent names to their obs. Only those
  132. # agents' names that require actions in the next call to `step()` should
  133. # be present in the returned observation dict (here: one agent at a time).
  134. print(env.reset())
  135. # ... {
  136. # ... "player1": [[...]],
  137. # ... }
  138. # In the following call to `step`, only those agents' actions should be
  139. # provided that were present in the returned obs dict:
  140. new_obs, rewards, dones, infos = env.step(actions={"player1": ...})
  141. # Similarly, new_obs, rewards, dones, etc. also become dicts.
  142. # Note that only in the `rewards` dict, any agent may be listed (even those that have
  143. # not(!) acted in the `step()` call). Rewards for individual agents will be added
  144. # up to the point where a new action for that agent is needed. This way, you may
  145. # implement a turn-based 2-player game, in which player-2's reward is published
  146. # in the `rewards` dict immediately after player-1 has acted.
  147. print(rewards)
  148. # ... {"player1": 0, "player2": 0}
  149. # Individual agents can early exit; The entire episode is done when
  150. # dones["__all__"] = True.
  151. print(dones)
  152. # ... {"player1": False, "__all__": False}
  153. # In the next step, it's player2's turn. Therefore, `new_obs` only container
  154. # this agent's ID:
  155. print(new_obs)
  156. # ... {
  157. # ... "player2": [[...]]
  158. # ... }
  159. If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
  160. .. code-block:: python
  161. algo = pg.PGAgent(env="my_multiagent_env", config={
  162. "multiagent": {
  163. "policies": {
  164. # Use the PolicySpec namedtuple to specify an individual policy:
  165. "car1": PolicySpec(
  166. policy_class=None, # infer automatically from Algorithm
  167. observation_space=None, # infer automatically from env
  168. action_space=None, # infer automatically from env
  169. config={"gamma": 0.85}, # use main config plus <- this override here
  170. ), # alternatively, simply do: `PolicySpec(config={"gamma": 0.85})`
  171. # Deprecated way: Tuple specifying class, obs-/action-spaces,
  172. # config-overrides for each policy as a tuple.
  173. # If class is None -> Uses Algorithm's default policy class.
  174. "car2": (None, car_obs_space, car_act_space, {"gamma": 0.99}),
  175. # New way: Use PolicySpec() with keywords: `policy_class`,
  176. # `observation_space`, `action_space`, `config`.
  177. "traffic_light": PolicySpec(
  178. observation_space=tl_obs_space, # special obs space for lights?
  179. action_space=tl_act_space, # special action space for lights?
  180. ),
  181. },
  182. "policy_mapping_fn":
  183. lambda agent_id, episode, worker, **kwargs:
  184. "traffic_light" # Traffic lights are always controlled by this policy
  185. if agent_id.startswith("traffic_light_")
  186. else random.choice(["car1", "car2"]) # Randomly choose from car policies
  187. },
  188. })
  189. while True:
  190. print(algo.train())
  191. To exclude some policies in your ``multiagent.policies`` dictionary, you can use the ``multiagent.policies_to_train`` setting.
  192. For example, you may want to have one or more random (non learning) policies interact with your learning ones:
  193. .. code-block:: python
  194. # Example for a mapping function that maps agent IDs "player1" and "player2" to either
  195. # "random_policy" or "learning_policy", making sure that in each episode, both policies
  196. # are always playing each other.
  197. def policy_mapping_fn(agent_id, episode, worker, **kwargs):
  198. agent_idx = int(agent_id[-1]) # 0 (player1) or 1 (player2)
  199. # agent_id = "player[1|2]" -> policy depends on episode ID
  200. # This way, we make sure that both policies sometimes play player1
  201. # (start player) and sometimes player2 (player to move 2nd).
  202. return "learning_policy" if episode.episode_id % 2 == agent_idx else "random_policy"
  203. algo = pg.PGAgent(env="two_player_game", config={
  204. "multiagent": {
  205. "policies": {
  206. "learning_policy": PolicySpec(), # <- use default class & infer obs-/act-spaces from env.
  207. "random_policy": PolicySpec(policy_class=RandomPolicy), # infer obs-/act-spaces from env.
  208. },
  209. # Example for a mapping function that maps agent IDs "player1" and "player2" to either
  210. # "random_policy" or "learning_policy", making sure that in each episode, both policies
  211. # are always playing each other.
  212. "policy_mapping_fn": policy_mapping_fn,
  213. # Specify a (fixed) list (or set) of policy IDs that should be updated.
  214. "policies_to_train": ["learning_policy"],
  215. # Alternatively, you can provide a callable that returns True or False, when provided
  216. # with a policy ID and an (optional) SampleBatch:
  217. # "policies_to_train": lambda pid, batch: ... (<- return True or False)
  218. # This allows you to more flexibly update (or not) policies, based on
  219. # who they played with in the episode (or other information that can be
  220. # found in the given batch, e.g. rewards).
  221. },
  222. })
  223. RLlib will create three distinct policies and route agent decisions to its bound policy using the given ``policy_mapping_fn``.
  224. When an agent first appears in the env, ``policy_mapping_fn`` will be called to determine which policy it is bound to.
  225. RLlib reports separate training statistics for each policy in the return from ``train()``, along with the combined reward.
  226. Here is a simple `example training script <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
  227. in which you can vary the number of agents and policies in the environment.
  228. For how to use multiple training methods at once (here DQN and PPO),
  229. see the `two-trainer example <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_two_trainers.py>`__.
  230. Metrics are reported for each policy separately, for example:
  231. .. code-block:: bash
  232. :emphasize-lines: 6,14,22
  233. Result for PPO_multi_cartpole_0:
  234. episode_len_mean: 34.025862068965516
  235. episode_reward_max: 159.0
  236. episode_reward_mean: 86.06896551724138
  237. info:
  238. policy_0:
  239. cur_lr: 4.999999873689376e-05
  240. entropy: 0.6833480000495911
  241. kl: 0.010264254175126553
  242. policy_loss: -11.95590591430664
  243. total_loss: 197.7039794921875
  244. vf_explained_var: 0.0010995268821716309
  245. vf_loss: 209.6578826904297
  246. policy_1:
  247. cur_lr: 4.999999873689376e-05
  248. entropy: 0.6827034950256348
  249. kl: 0.01119876280426979
  250. policy_loss: -8.787769317626953
  251. total_loss: 88.26161193847656
  252. vf_explained_var: 0.0005457401275634766
  253. vf_loss: 97.0471420288086
  254. policy_reward_mean:
  255. policy_0: 21.194444444444443
  256. policy_1: 21.798387096774192
  257. To scale to hundreds of agents (if these agents are using the same policy), MultiAgentEnv batches policy evaluations across multiple agents internally.
  258. Your ``MultiAgentEnvs`` are also auto-vectorized (as can be normal, single-agent envs, e.g. gym.Env) by setting ``num_envs_per_worker > 1``.
  259. PettingZoo Multi-Agent Environments
  260. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  261. `PettingZoo <https://github.com/Farama-Foundation/PettingZoo>`__ is a repository of over 50 diverse multi-agent environments. However, the API is not directly compatible with rllib, but it can be converted into an rllib MultiAgentEnv like in this example
  262. .. code-block:: python
  263. from ray.tune.registry import register_env
  264. # import the pettingzoo environment
  265. from pettingzoo.butterfly import prison_v3
  266. # import rllib pettingzoo interface
  267. from ray.rllib.env import PettingZooEnv
  268. # define how to make the environment. This way takes an optional environment config, num_floors
  269. env_creator = lambda config: prison_v3.env(num_floors=config.get("num_floors", 4))
  270. # register that way to make the environment under an rllib name
  271. register_env('prison', lambda config: PettingZooEnv(env_creator(config)))
  272. # now you can use `prison` as an environment
  273. # you can pass arguments to the environment creator with the env_config option in the config
  274. config['env_config'] = {"num_floors": 5}
  275. A more complete example is here: `rllib_pistonball.py <https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/Ray/rllib_pistonball.py>`__
  276. Rock Paper Scissors Example
  277. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  278. The `rock_paper_scissors_multiagent.py <https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py>`__ example demonstrates several types of policies competing against each other: heuristic policies of repeating the same move, beating the last opponent move, and learned LSTM and feedforward policies.
  279. .. figure:: images/rock-paper-scissors.png
  280. TensorBoard output of running the rock-paper-scissors example, where a learned policy faces off between a random selection of the same-move and beat-last-move heuristics. Here the performance of heuristic policies vs the learned policy is compared with LSTM enabled (blue) and a plain feed-forward policy (red). While the feedforward policy can easily beat the same-move heuristic by simply avoiding the last move taken, it takes a LSTM policy to distinguish between and consistently beat both policies.
  281. Variable-Sharing Between Policies
  282. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  283. .. note::
  284. With `ModelV2 <rllib-models.html#tensorflow-models>`__, you can put layers in global variables and straightforwardly share those layer objects between models instead of using variable scopes.
  285. RLlib will create each policy's model in a separate ``tf.variable_scope``. However, variables can still be shared between policies by explicitly entering a globally shared variable scope with ``tf.VariableScope(reuse=tf.AUTO_REUSE)``:
  286. .. code-block:: python
  287. with tf.variable_scope(
  288. tf.VariableScope(tf.AUTO_REUSE, "name_of_global_shared_scope"),
  289. reuse=tf.AUTO_REUSE,
  290. auxiliary_name_scope=False):
  291. <create the shared layers here>
  292. There is a full example of this in the `example training script <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__.
  293. Implementing a Centralized Critic
  294. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  295. Here are two ways to implement a centralized critic compatible with the multi-agent API:
  296. **Strategy 1: Sharing experiences in the trajectory preprocessor**:
  297. The most general way of implementing a centralized critic involves defining the ``postprocess_fn`` method of a custom policy. ``postprocess_fn`` is called by ``Policy.postprocess_trajectory``, which has full access to the policies and observations of concurrent agents via the ``other_agent_batches`` and ``episode`` arguments. The batch of critic predictions can then be added to the postprocessed trajectory. Here's an example:
  298. .. code-block:: python
  299. def postprocess_fn(policy, sample_batch, other_agent_batches, episode):
  300. agents = ["agent_1", "agent_2", "agent_3"] # simple example of 3 agents
  301. global_obs_batch = np.stack(
  302. [other_agent_batches[agent_id][1]["obs"] for agent_id in agents],
  303. axis=1)
  304. # add the global obs and global critic value
  305. sample_batch["global_obs"] = global_obs_batch
  306. sample_batch["central_vf"] = self.sess.run(
  307. self.critic_network, feed_dict={"obs": global_obs_batch})
  308. return sample_batch
  309. To update the critic, you'll also have to modify the loss of the policy. For an end-to-end runnable example, see `examples/centralized_critic.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic.py>`__.
  310. **Strategy 2: Sharing observations through an observation function**:
  311. Alternatively, you can use an observation function to share observations between agents. In this strategy, each observation includes all global state, and policies use a custom model to ignore state they aren't supposed to "see" when computing actions. The advantage of this approach is that it's very simple and you don't have to change the algorithm at all -- just use the observation func (i.e., like an env wrapper) and custom model. However, it is a bit less principled in that you have to change the agent observation spaces to include training-time only information. You can find a runnable example of this strategy at `examples/centralized_critic_2.py <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__.
  312. Grouping Agents
  313. ~~~~~~~~~~~~~~~
  314. It is common to have groups of agents in multi-agent RL. RLlib treats agent groups like a single agent with a Tuple action and observation space. The group agent can then be assigned to a single policy for centralized execution, or to specialized multi-agent policies such as :ref:`Q-Mix <qmix>` that implement centralized training but decentralized execution. You can use the ``MultiAgentEnv.with_agent_groups()`` method to define these groups:
  315. .. literalinclude:: ../../../rllib/env/multi_agent_env.py
  316. :language: python
  317. :start-after: __grouping_doc_begin__
  318. :end-before: __grouping_doc_end__
  319. For environments with multiple groups, or mixtures of agent groups and individual agents, you can use grouping in conjunction with the policy mapping API described in prior sections.
  320. Hierarchical Environments
  321. ~~~~~~~~~~~~~~~~~~~~~~~~~
  322. Hierarchical training can sometimes be implemented as a special case of multi-agent RL. For example, consider a three-level hierarchy of policies, where a top-level policy issues high level actions that are executed at finer timescales by a mid-level and low-level policy. The following timeline shows one step of the top-level policy, which corresponds to two mid-level actions and five low-level actions:
  323. .. code-block:: text
  324. top_level ---------------------------------------------------------------> top_level --->
  325. mid_level_0 -------------------------------> mid_level_0 ----------------> mid_level_1 ->
  326. low_level_0 -> low_level_0 -> low_level_0 -> low_level_1 -> low_level_1 -> low_level_2 ->
  327. This can be implemented as a multi-agent environment with three types of agents. Each higher-level action creates a new lower-level agent instance with a new id (e.g., ``low_level_0``, ``low_level_1``, ``low_level_2`` in the above example). These lower-level agents pop in existence at the start of higher-level steps, and terminate when their higher-level action ends. Their experiences are aggregated by policy, so from RLlib's perspective it's just optimizing three different types of policies. The configuration might look something like this:
  328. .. code-block:: python
  329. "multiagent": {
  330. "policies": {
  331. "top_level": (custom_policy or None, ...),
  332. "mid_level": (custom_policy or None, ...),
  333. "low_level": (custom_policy or None, ...),
  334. },
  335. "policy_mapping_fn":
  336. lambda agent_id:
  337. "low_level" if agent_id.startswith("low_level_") else
  338. "mid_level" if agent_id.startswith("mid_level_") else "top_level"
  339. "policies_to_train": ["top_level"],
  340. },
  341. In this setup, the appropriate rewards for training lower-level agents must be provided by the multi-agent env implementation.
  342. The environment class is also responsible for routing between the agents, e.g., conveying `goals <https://arxiv.org/pdf/1703.01161.pdf>`__ from higher-level
  343. agents to lower-level agents as part of the lower-level agent observation.
  344. See this file for a runnable example: `hierarchical_training.py <https://github.com/ray-project/ray/blob/master/rllib/examples/hierarchical_training.py>`__.
  345. External Agents and Applications
  346. --------------------------------
  347. In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with **external simulators** (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.
  348. .. figure:: images/rllib-training-inside-a-unity3d-env.png
  349. :scale: 75 %
  350. A Unity3D soccer game being learnt by RLlib via the ExternalEnv API.
  351. RLlib provides the `ExternalEnv <https://github.com/ray-project/ray/blob/master/rllib/env/external_env.py>`__ class for this purpose.
  352. Unlike other envs, ExternalEnv has its own thread of control. At any point, agents on that thread can query the current policy for decisions via ``self.get_action()`` and reports rewards, done-dicts, and infos via ``self.log_returns()``.
  353. This can be done for multiple concurrent episodes as well.
  354. Take a look at the examples here for a `simple "CartPole-v1" server <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_server.py>`__
  355. and `n client(s) <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_client.py>`__
  356. scripts, in which we setup an RLlib policy server that listens on one or more ports for client connections
  357. and connect several clients to this server to learn the env.
  358. Another `example <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/unity3d_server.py>`__ shows,
  359. how to run a similar setup against a Unity3D external game engine.
  360. Logging off-policy actions
  361. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  362. ExternalEnv provides a ``self.log_action()`` call to support off-policy actions. This allows the client to make independent decisions, e.g., to compare two different policies, and for RLlib to still learn from those off-policy actions. Note that this requires the algorithm used to support learning from off-policy decisions (e.g., DQN).
  363. .. seealso::
  364. `Offline Datasets <rllib-offline.html>`__ provide higher-level interfaces for working with off-policy experience datasets.
  365. External Application Clients
  366. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  367. For applications that are running entirely outside the Ray cluster (i.e., cannot be packaged into a Python environment of any form), RLlib provides the ``PolicyServerInput`` application connector, which can be connected to over the network using ``PolicyClient`` instances.
  368. You can configure any Algorithm to launch a policy server with the following config:
  369. .. code-block:: python
  370. config = {
  371. # An environment class is still required, but it doesn't need to be runnable.
  372. # You only need to define its action and observation space attributes.
  373. # See examples/serving/unity3d_server.py for an example using a RandomMultiAgentEnv stub.
  374. "env": YOUR_ENV_STUB,
  375. # Use the policy server to generate experiences.
  376. "input": (
  377. lambda ioctx: PolicyServerInput(ioctx, SERVER_ADDRESS, SERVER_PORT)
  378. ),
  379. # Use the existing algorithm process to run the server.
  380. "num_workers": 0,
  381. }
  382. Clients can then connect in either *local* or *remote* inference mode.
  383. In local inference mode, copies of the policy are downloaded from the server and cached on the client for a configurable period of time.
  384. This allows actions to be computed by the client without requiring a network round trip each time.
  385. In remote inference mode, each computed action requires a network call to the server.
  386. Example:
  387. .. code-block:: python
  388. client = PolicyClient("http://localhost:9900", inference_mode="local")
  389. episode_id = client.start_episode()
  390. ...
  391. action = client.get_action(episode_id, cur_obs)
  392. ...
  393. client.end_episode(episode_id, last_obs)
  394. To understand the difference between standard envs, external envs, and connecting with a ``PolicyClient``, refer to the following figure:
  395. .. https://docs.google.com/drawings/d/1hJvT9bVGHVrGTbnCZK29BYQIcYNRbZ4Dr6FOPMJDjUs/edit
  396. .. image:: images/rllib-external.svg
  397. Try it yourself by launching either a
  398. `simple CartPole server <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_server.py>`__ (see below), and connecting it to any number of clients
  399. (`cartpole_client.py <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_client.py>`__) or
  400. run a `Unity3D learning sever <https://github.com/ray-project/ray/blob/master/rllib/examples/serving/unity3d_server.py>`__
  401. against distributed Unity game engines in the cloud.
  402. CartPole Example:
  403. .. code-block:: bash
  404. # Start the server by running:
  405. >>> python rllib/examples/serving/cartpole_server.py --run=PPO
  406. --
  407. -- Starting policy server at localhost:9900
  408. --
  409. # To connect from a client with inference_mode="remote".
  410. >>> python rllib/examples/serving/cartpole_client.py --inference-mode=remote
  411. Total reward: 10.0
  412. Total reward: 58.0
  413. ...
  414. Total reward: 200.0
  415. ...
  416. # To connect from a client with inference_mode="local" (faster).
  417. >>> python rllib/examples/serving/cartpole_client.py --inference-mode=local
  418. Querying server for new policy weights.
  419. Generating new batch of experiences.
  420. Total reward: 13.0
  421. Total reward: 11.0
  422. ...
  423. Sending batch of 1000 steps back to server.
  424. Querying server for new policy weights.
  425. ...
  426. Total reward: 200.0
  427. ...
  428. For the best performance, we recommend using ``inference_mode="local"`` when possible.
  429. Advanced Integrations
  430. ---------------------
  431. For more complex / high-performance environment integrations, you can instead extend the low-level `BaseEnv <https://github.com/ray-project/ray/blob/master/rllib/env/base_env.py>`__ class. This low-level API models multiple agents executing asynchronously in multiple environments. A call to ``BaseEnv:poll()`` returns observations from ready agents keyed by 1) their environment, then 2) agent ids. Actions for those agents are sent back via ``BaseEnv:send_actions()``. BaseEnv is used to implement all the other env types in RLlib, so it offers a superset of their functionality. For example, ``BaseEnv`` is used to implement dynamic batching of observations for inference over `multiple simulator actors <https://github.com/ray-project/ray/blob/master/rllib/env/remote_vector_env.py>`__.
  432. .. include:: /_includes/rllib/announcement_bottom.rst