rllib-advanced-api.rst 25 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599
  1. .. _rllib-advanced-api-doc:
  2. Advanced Python APIs
  3. --------------------
  4. Custom Training Workflows
  5. ~~~~~~~~~~~~~~~~~~~~~~~~~
  6. In the `basic training example <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__,
  7. Tune will call ``train()`` on your algorithm once per training iteration and report
  8. the new training results.
  9. Sometimes, it is desirable to have full control over training, but still run inside Tune.
  10. Tune supports :ref:`custom trainable functions <trainable-docs>` that can be used to
  11. implement `custom training workflows (example) <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_train_fn.py>`__.
  12. For even finer-grained control over training, you can use RLlib's lower-level
  13. `building blocks <rllib-concepts.html>`__ directly to implement
  14. `fully customized training workflows <https://github.com/ray-project/ray/blob/master/rllib/examples/rollout_worker_custom_workflow.py>`__.
  15. Curriculum Learning
  16. ~~~~~~~~~~~~~~~~~~~
  17. In Curriculum learning, the environment can be set to different difficulties
  18. (or "tasks") to allow for learning to progress through controlled phases (from easy to
  19. more difficult). RLlib comes with a basic curriculum learning API utilizing the
  20. `TaskSettableEnv <https://github.com/ray-project/ray/blob/master/rllib/env/apis/task_settable_env.py>`__ environment API.
  21. Your environment only needs to implement the `set_task` and `get_task` methods
  22. for this to work. You can then define an `env_task_fn` in your config,
  23. which receives the last training results and returns a new task for the env to be set to:
  24. .. TODO move to doc_code and make it use algo configs.
  25. .. code-block:: python
  26. from ray.rllib.env.apis.task_settable_env import TaskSettableEnv
  27. class MyEnv(TaskSettableEnv):
  28. def get_task(self):
  29. return self.current_difficulty
  30. def set_task(self, task):
  31. self.current_difficulty = task
  32. def curriculum_fn(train_results, task_settable_env, env_ctx):
  33. # Very simple curriculum function.
  34. current_task = task_settable_env.get_task()
  35. new_task = current_task + 1
  36. return new_task
  37. # Setup your Algorithm's config like so:
  38. config = {
  39. "env": MyEnv,
  40. "env_task_fn": curriculum_fn,
  41. }
  42. # Train using `Tuner.fit()` or `Algorithm.train()` and the above config stub.
  43. # ...
  44. There are two more ways to use the RLlib's other APIs to implement
  45. `curriculum learning <https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/>`__.
  46. Use the Algorithm API and update the environment between calls to ``train()``.
  47. This example shows the algorithm being run inside a Tune function.
  48. This is basically the same as what the built-in `env_task_fn` API described above
  49. already does under the hood, but allows you to do even more customizations to your
  50. training loop.
  51. .. TODO move to doc_code and make it use algo configs.
  52. .. code-block:: python
  53. import ray
  54. from ray import tune
  55. from ray.rllib.algorithms.ppo import PPO
  56. def train(config, reporter):
  57. algo = PPO(config=config, env=YourEnv)
  58. while True:
  59. result = algo.train()
  60. reporter(**result)
  61. if result["episode_reward_mean"] > 200:
  62. task = 2
  63. elif result["episode_reward_mean"] > 100:
  64. task = 1
  65. else:
  66. task = 0
  67. algo.workers.foreach_worker(
  68. lambda ev: ev.foreach_env(
  69. lambda env: env.set_task(task)))
  70. num_gpus = 0
  71. num_workers = 2
  72. ray.init()
  73. tune.Tuner(
  74. tune.with_resources(train, resources=tune.PlacementGroupFactory(
  75. [{"CPU": 1}, {"GPU": num_gpus}] + [{"CPU": 1}] * num_workers
  76. ),)
  77. param_space={
  78. "num_gpus": num_gpus,
  79. "num_workers": num_workers,
  80. },
  81. ).fit()
  82. You could also use RLlib's callbacks API to update the environment on new training
  83. results:
  84. .. TODO move to doc_code and make it use algo configs.
  85. .. code-block:: python
  86. import ray
  87. from ray import tune
  88. from ray.rllib.agents.callbacks import DefaultCallbacks
  89. class MyCallbacks(DefaultCallbacks):
  90. def on_train_result(self, algorithm, result, **kwargs):
  91. if result["episode_reward_mean"] > 200:
  92. task = 2
  93. elif result["episode_reward_mean"] > 100:
  94. task = 1
  95. else:
  96. task = 0
  97. algorithm.workers.foreach_worker(
  98. lambda ev: ev.foreach_env(
  99. lambda env: env.set_task(task)))
  100. ray.init()
  101. tune.Tuner(
  102. "PPO",
  103. param_space={
  104. "env": YourEnv,
  105. "callbacks": MyCallbacks,
  106. },
  107. ).fit()
  108. Global Coordination
  109. ~~~~~~~~~~~~~~~~~~~
  110. Sometimes, it is necessary to coordinate between pieces of code that live in different
  111. processes managed by RLlib.
  112. For example, it can be useful to maintain a global average of a certain variable,
  113. or centrally control a hyperparameter used by policies.
  114. Ray provides a general way to achieve this through *named actors*
  115. (learn more about :ref:`Ray actors here <actor-guide>`).
  116. These actors are assigned a global name and handles to them can be retrieved using
  117. these names. As an example, consider maintaining a shared global counter that is
  118. incremented by environments and read periodically from your driver program:
  119. .. literalinclude:: ./doc_code/advanced_api.py
  120. :language: python
  121. :start-after: __rllib-adv_api_counter_begin__
  122. :end-before: __rllib-adv_api_counter_end__
  123. Ray actors provide high levels of performance, so in more complex cases they can be
  124. used implement communication patterns such as parameter servers and allreduce.
  125. Callbacks and Custom Metrics
  126. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  127. You can provide callbacks to be called at points during policy evaluation.
  128. These callbacks have access to state for the current
  129. `episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__.
  130. Certain callbacks such as ``on_postprocess_trajectory``, ``on_sample_end``,
  131. and ``on_train_result`` are also places where custom postprocessing can be applied to
  132. intermediate data or results.
  133. User-defined state can be stored for the
  134. `episode <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__
  135. in the ``episode.user_data`` dict, and custom scalar metrics reported by saving values
  136. to the ``episode.custom_metrics`` dict. These custom metrics will be aggregated and
  137. reported as part of training results. For a full example, take a look at
  138. `this example script here <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_metrics_and_callbacks.py>`__
  139. and
  140. `these unit test cases here <https://github.com/ray-project/ray/blob/master/rllib/algorithms/tests/test_callbacks.py>`__.
  141. .. tip::
  142. You can create custom logic that can run on each evaluation episode by checking
  143. if the :py:class:`~ray.rllib.evaluation.rollout_worker.RolloutWorker` is in
  144. evaluation mode, through accessing ``worker.policy_config["in_evaluation"]``.
  145. You can then implement this check in ``on_episode_start()`` or ``on_episode_end()``
  146. in your subclass of :py:class:`~ray.rllib.algorithms.callbacks.DefaultCallbacks`.
  147. For running callbacks before and after the evaluation
  148. runs in whole we provide ``on_evaluate_start()`` and ``on_evaluate_end``.
  149. .. dropdown:: Click here to see the full API of the ``DefaultCallbacks`` class
  150. .. autoclass:: ray.rllib.algorithms.callbacks.DefaultCallbacks
  151. :members:
  152. Chaining Callbacks
  153. ~~~~~~~~~~~~~~~~~~
  154. Use the ``MultiCallbacks`` class to chaim multiple callbacks together.
  155. .. autoclass:: ray.rllib.algorithms.callbacks.MultiCallbacks
  156. Visualizing Custom Metrics
  157. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  158. Custom metrics can be accessed and visualized like any other training result:
  159. .. image:: images/custom_metric.png
  160. .. _exploration-api:
  161. Customizing Exploration Behavior
  162. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  163. RLlib offers a unified top-level API to configure and customize an agent’s
  164. exploration behavior, including the decisions (how and whether) to sample
  165. actions from distributions (stochastically or deterministically).
  166. The setup can be done via using built-in Exploration classes
  167. (see `this package <https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/>`__),
  168. which are specified (and further configured) inside
  169. ``AlgorithmConfig().exploration(..)``.
  170. Besides using one of the available classes, one can sub-class any of
  171. these built-ins, add custom behavior to it, and use that new class in
  172. the config instead.
  173. Every policy has-an Exploration object, which is created from the AlgorithmConfig’s
  174. ``.exploration(exploration_config=...)`` method, which specifies the class to use via the
  175. special “type” key, as well as constructor arguments via all other keys,
  176. e.g.:
  177. .. literalinclude:: ./doc_code/advanced_api.py
  178. :language: python
  179. :start-after: __rllib-adv_api_explore_begin__
  180. :end-before: __rllib-adv_api_explore_end__
  181. The following table lists all built-in Exploration sub-classes and the agents
  182. that currently use these by default:
  183. .. View table below at: https://docs.google.com/drawings/d/1dEMhosbu7HVgHEwGBuMlEDyPiwjqp_g6bZ0DzCMaoUM/edit?usp=sharing
  184. .. image:: images/rllib-exploration-api-table.svg
  185. An Exploration class implements the ``get_exploration_action`` method,
  186. in which the exact exploratory behavior is defined.
  187. It takes the model’s output, the action distribution class, the model itself,
  188. a timestep (the global env-sampling steps already taken),
  189. and an ``explore`` switch and outputs a tuple of a) action and
  190. b) log-likelihood:
  191. .. literalinclude:: ../../../rllib/utils/exploration/exploration.py
  192. :language: python
  193. :start-after: __sphinx_doc_begin_get_exploration_action__
  194. :end-before: __sphinx_doc_end_get_exploration_action__
  195. On the highest level, the ``Algorithm.compute_actions`` and ``Policy.compute_actions``
  196. methods have a boolean ``explore`` switch, which is passed into
  197. ``Exploration.get_exploration_action``. If ``explore=None``, the value of
  198. ``Algorithm.config[“explore”]`` is used, which thus serves as a main switch for
  199. exploratory behavior, allowing e.g. turning off any exploration easily for
  200. evaluation purposes (see :ref:`CustomEvaluation`).
  201. The following are example excerpts from different Algorithms' configs
  202. (see ``rllib/algorithms/algorithm.py``) to setup different exploration behaviors:
  203. .. TODO move to doc_code and make it use algo configs.
  204. .. code-block:: python
  205. # All of the following configs go into Algorithm.config.
  206. # 1) Switching *off* exploration by default.
  207. # Behavior: Calling `compute_action(s)` without explicitly setting its `explore`
  208. # param will result in no exploration.
  209. # However, explicitly calling `compute_action(s)` with `explore=True` will
  210. # still(!) result in exploration (per-call overrides default).
  211. "explore": False,
  212. # 2) Switching *on* exploration by default.
  213. # Behavior: Calling `compute_action(s)` without explicitly setting its
  214. # explore param will result in exploration.
  215. # However, explicitly calling `compute_action(s)` with `explore=False`
  216. # will result in no(!) exploration (per-call overrides default).
  217. "explore": True,
  218. # 3) Example exploration_config usages:
  219. # a) DQN: see rllib/algorithms/dqn/dqn.py
  220. "explore": True,
  221. "exploration_config": {
  222. # Exploration sub-class by name or full path to module+class
  223. # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
  224. "type": "EpsilonGreedy",
  225. # Parameters for the Exploration class' constructor:
  226. "initial_epsilon": 1.0,
  227. "final_epsilon": 0.02,
  228. "epsilon_timesteps": 10000, # Timesteps over which to anneal epsilon.
  229. },
  230. # b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
  231. "explore": True,
  232. "exploration_config": {
  233. "type": "SoftQ",
  234. # Parameters for the Exploration class' constructor:
  235. "temperature": 1.0,
  236. },
  237. # c) All policy-gradient algos and SAC: see rllib/algorithms/algorithm.py
  238. # Behavior: The algo samples stochastically from the
  239. # model-parameterized distribution. This is the global Algorithm default
  240. # setting defined in algorithm.py and used by all PG-type algos (plus SAC).
  241. "explore": True,
  242. "exploration_config": {
  243. "type": "StochasticSampling",
  244. "random_timesteps": 0, # timesteps at beginning, over which to act uniformly randomly
  245. },
  246. .. _CustomEvaluation:
  247. Customized Evaluation During Training
  248. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  249. RLlib will report online training rewards, however in some cases you may want to compute
  250. rewards with different settings (e.g., with exploration turned off, or on a specific set
  251. of environment configurations). You can activate evaluating policies during training
  252. (``Algorithm.train()``) by setting the ``evaluation_interval`` to an int value (> 0)
  253. indicating every how many ``Algorithm.train()`` calls an "evaluation step" is run:
  254. .. TODO move to doc_code and make it use algo configs.
  255. .. code-block:: python
  256. # Run one evaluation step on every 3rd `Algorithm.train()` call.
  257. {
  258. "evaluation_interval": 3,
  259. }
  260. An evaluation step runs - using its own ``RolloutWorker``s - for ``evaluation_duration``
  261. episodes or time-steps, depending on the ``evaluation_duration_unit`` setting, which can
  262. take values of either ``"episodes"`` (default) or ``"timesteps"``.
  263. .. TODO move to doc_code and make it use algo configs.
  264. .. code-block:: python
  265. # Every time we run an evaluation step, run it for exactly 10 episodes.
  266. {
  267. "evaluation_duration": 10,
  268. "evaluation_duration_unit": "episodes",
  269. }
  270. # Every time we run an evaluation step, run it for (close to) 200 timesteps.
  271. {
  272. "evaluation_duration": 200,
  273. "evaluation_duration_unit": "timesteps",
  274. }
  275. Note: When using ``evaluation_duration_unit=timesteps`` and your ``evaluation_duration``
  276. setting is not divisible by the number of evaluation workers (configurable via
  277. ``evaluation_num_workers``), RLlib will round up the number of time-steps specified to
  278. the nearest whole number of time-steps that is divisible by the number of evaluation
  279. workers.
  280. Also, when using ``evaluation_duration_unit=episodes`` and your
  281. ``evaluation_duration`` setting is not divisible by the number of evaluation workers
  282. (configurable via ``evaluation_num_workers``), RLlib will run the remainder of episodes
  283. on the first n eval RolloutWorkers and leave the remaining workers idle for that time.
  284. For example:
  285. .. TODO move to doc_code and make it use algo configs.
  286. .. code-block:: python
  287. # Every time we run an evaluation step, run it for exactly 10 episodes, no matter, how many eval workers we have.
  288. {
  289. "evaluation_duration": 10,
  290. "evaluation_duration_unit": "episodes",
  291. # What if number of eval workers is non-dividable by 10?
  292. # -> Run 7 episodes (1 per eval worker), then run 3 more episodes only using
  293. # evaluation workers 1-3 (evaluation workers 4-7 remain idle during that time).
  294. "evaluation_num_workers": 7,
  295. }
  296. Before each evaluation step, weights from the main model are synchronized
  297. to all evaluation workers.
  298. By default, the evaluation step (if there is one in the current iteration) is run
  299. right **after** the respective training step.
  300. For example, for ``evaluation_interval=1``, the sequence of events is:
  301. ``train(0->1), eval(1), train(1->2), eval(2), train(2->3), ...``.
  302. Here, the indices show the version of neural network weights used.
  303. ``train(0->1)`` is an update step that changes the weights from version 0 to
  304. version 1 and ``eval(1)`` then uses weights version 1.
  305. Weights index 0 represents the randomly initialized weights of our neural network(s).
  306. Another example: For ``evaluation_interval=2``, the sequence is:
  307. ``train(0->1), train(1->2), eval(2), train(2->3), train(3->4), eval(4), ...``.
  308. Instead of running ``train``- and ``eval``-steps in sequence, it is also possible to
  309. run them in parallel via the ``evaluation_parallel_to_training=True`` config setting.
  310. In this case, both training- and evaluation steps are run at the same time via
  311. multi-threading.
  312. This can speed up the evaluation process significantly, but leads to a 1-iteration
  313. delay between reported training- and evaluation results.
  314. The evaluation results are behind in this case b/c they use slightly outdated
  315. model weights (synchronized after the previous training step).
  316. For example, for ``evaluation_parallel_to_training=True`` and ``evaluation_interval=1``,
  317. the sequence is now:
  318. ``train(0->1) + eval(0), train(1->2) + eval(1), train(2->3) + eval(2)``,
  319. where ``+`` means: "at the same time".
  320. Note that the change in the weights indices with respect to the non-parallel examples above.
  321. The evaluation weights indices are now "one behind"
  322. the resulting train weights indices (``train(1->**2**) + eval(**1**)``).
  323. When running with the ``evaluation_parallel_to_training=True`` setting, a special "auto" value
  324. is supported for ``evaluation_duration``. This can be used to make the evaluation step take
  325. roughly as long as the concurrently ongoing training step:
  326. .. TODO move to doc_code and make it use algo configs.
  327. .. code-block:: python
  328. # Run evaluation and training at the same time via threading and make sure they roughly
  329. # take the same time, such that the next `Algorithm.train()` call can execute
  330. # immediately and not have to wait for a still ongoing (e.g. b/c of very long episodes)
  331. # evaluation step:
  332. {
  333. "evaluation_interval": 1,
  334. "evaluation_parallel_to_training": True,
  335. "evaluation_duration": "auto", # automatically end evaluation when train step has finished
  336. "evaluation_duration_unit": "timesteps", # <- more fine grained than "episodes"
  337. }
  338. The ``evaluation_config`` key allows you to override any config settings for
  339. the evaluation workers. For example, to switch off exploration in the evaluation steps,
  340. do:
  341. .. TODO move to doc_code and make it use algo configs.
  342. .. code-block:: python
  343. # Switching off exploration behavior for evaluation workers
  344. # (see rllib/algorithms/algorithm.py). Use any keys in this sub-dict that are
  345. # also supported in the main Algorithm config.
  346. "evaluation_config": {
  347. "explore": False
  348. }
  349. .. note::
  350. Policy gradient algorithms are able to find the optimal
  351. policy, even if this is a stochastic one. Setting "explore=False" above
  352. will result in the evaluation workers not using this stochastic policy.
  353. The level of parallelism within the evaluation step is determined via the
  354. ``evaluation_num_workers`` setting. Set this to larger values if you want the desired
  355. evaluation episodes or time-steps to run as much in parallel as possible.
  356. For example, if your ``evaluation_duration=10``, ``evaluation_duration_unit=episodes``,
  357. and ``evaluation_num_workers=10``, each evaluation ``RolloutWorker``
  358. only has to run one episode in each evaluation step.
  359. In case you observe occasional failures in your (evaluation) RolloutWorkers during
  360. evaluation (e.g. you have an environment that sometimes crashes),
  361. you can use an (experimental) new setting: ``enable_async_evaluation=True``.
  362. This will run the parallel sampling of all evaluation RolloutWorkers via a fault
  363. tolerant, asynchronous manager, such that if one of the workers takes too long to run
  364. through an episode and return data or fails entirely, the other evaluation
  365. RolloutWorkers will pick up its task and complete the job.
  366. Note that with or without async evaluation, all
  367. :ref:`fault tolerance settings <rllib-scaling-guide>`, such as
  368. ``ignore_worker_failures`` or ``recreate_failed_workers`` will be respected and applied
  369. to the failed evaluation workers.
  370. Here's an example:
  371. .. TODO move to doc_code and make it use algo configs.
  372. .. code-block:: python
  373. # Having an environment that occasionally blocks completely for e.g. 10min would
  374. # also affect (and block) training:
  375. {
  376. "evaluation_interval": 1,
  377. "evaluation_parallel_to_training": True,
  378. "evaluation_num_workers": 5, # each worker runs two episodes
  379. "evaluation_duration": 10,
  380. "evaluation_duration_unit": "episodes",
  381. }
  382. **Problem with the above example:**
  383. In case the environment used by worker 3 blocks for 10min, the entire training
  384. and evaluation pipeline will come to a (10min) halt b/c of this.
  385. The next ``train`` step cannot start before all evaluation has been finished.
  386. **Solution:**
  387. Switch on asynchronous evaluation, meaning, we don't wait for individual
  388. evaluation RolloutWorkers to complete their n episode(s) (or ``n`` time-steps).
  389. Instead, any evaluation RolloutWorker can cover the load of another one that failed
  390. or is stuck in a very long lasting environment step.
  391. .. TODO move to doc_code and make it use algo configs.
  392. .. code-block:: python
  393. {
  394. # ...
  395. # same settings as above, plus:
  396. "enable_async_evaluation": True, # evaluate asynchronously
  397. }
  398. In case you would like to entirely customize the evaluation step,
  399. set ``custom_eval_function`` in your config to a callable, which takes the Algorithm
  400. object and a WorkerSet object (the Algorithm's ``self.evaluation_workers``
  401. WorkerSet instance) and returns a metrics dictionary.
  402. See `algorithm.py <https://github.com/ray-project/ray/blob/master/rllib/algorithms/algorithm.py>`__
  403. for further documentation.
  404. There is also an end-to-end example of how to set up a custom online evaluation in
  405. `custom_eval.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_eval.py>`__.
  406. Note that if you only want to evaluate your policy at the end of training,
  407. you can set ``evaluation_interval: [int]``, where ``[int]`` should be the number
  408. of training iterations before stopping.
  409. Below are some examples of how the custom evaluation metrics are reported nested under
  410. the ``evaluation`` key of normal training results:
  411. .. TODO make sure these outputs are still valid.
  412. .. code-block:: bash
  413. ------------------------------------------------------------------------
  414. Sample output for `python custom_eval.py`
  415. ------------------------------------------------------------------------
  416. INFO algorithm.py:623 -- Evaluating current policy for 10 episodes.
  417. INFO algorithm.py:650 -- Running round 0 of parallel evaluation (2/10 episodes)
  418. INFO algorithm.py:650 -- Running round 1 of parallel evaluation (4/10 episodes)
  419. INFO algorithm.py:650 -- Running round 2 of parallel evaluation (6/10 episodes)
  420. INFO algorithm.py:650 -- Running round 3 of parallel evaluation (8/10 episodes)
  421. INFO algorithm.py:650 -- Running round 4 of parallel evaluation (10/10 episodes)
  422. Result for PG_SimpleCorridor_2c6b27dc:
  423. ...
  424. evaluation:
  425. custom_metrics: {}
  426. episode_len_mean: 15.864661654135338
  427. episode_reward_max: 1.0
  428. episode_reward_mean: 0.49624060150375937
  429. episode_reward_min: 0.0
  430. episodes_this_iter: 133
  431. .. code-block:: bash
  432. ------------------------------------------------------------------------
  433. Sample output for `python custom_eval.py --custom-eval`
  434. ------------------------------------------------------------------------
  435. INFO algorithm.py:631 -- Running custom eval function <function ...>
  436. Update corridor length to 4
  437. Update corridor length to 7
  438. Custom evaluation round 1
  439. Custom evaluation round 2
  440. Custom evaluation round 3
  441. Custom evaluation round 4
  442. Result for PG_SimpleCorridor_0de4e686:
  443. ...
  444. evaluation:
  445. custom_metrics: {}
  446. episode_len_mean: 9.15695067264574
  447. episode_reward_max: 1.0
  448. episode_reward_mean: 0.9596412556053812
  449. episode_reward_min: 0.0
  450. episodes_this_iter: 223
  451. foo: 1
  452. Rewriting Trajectories
  453. ~~~~~~~~~~~~~~~~~~~~~~
  454. Note that in the ``on_postprocess_traj`` callback you have full access to the
  455. trajectory batch (``post_batch``) and other training state. This can be used to
  456. rewrite the trajectory, which has a number of uses including:
  457. * Backdating rewards to previous time steps (e.g., based on values in ``info``).
  458. * Adding model-based curiosity bonuses to rewards (you can train the model with a
  459. `custom model supervised loss <rllib-models.html#supervised-model-losses>`__).
  460. To access the policy / model (``policy.model``) in the callbacks, note that
  461. ``info['pre_batch']`` returns a tuple where the first element is a policy and the
  462. second one is the batch itself. You can also access all the rollout worker state
  463. using the following call:
  464. .. TODO move to doc_code and make it use algo configs.
  465. .. code-block:: python
  466. from ray.rllib.evaluation.rollout_worker import get_global_worker
  467. # You can use this from any callback to get a reference to the
  468. # RolloutWorker running in the process, which in turn has references to
  469. # all the policies, etc: see rollout_worker.py for more info.
  470. rollout_worker = get_global_worker()
  471. Policy losses are defined over the ``post_batch`` data, so you can mutate that in
  472. the callbacks to change what data the policy loss function sees.