rllib-offline.rst 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443
  1. .. include:: /_includes/rllib/announcement.rst
  2. .. include:: /_includes/rllib/we_are_hiring.rst
  3. Working With Offline Data
  4. =========================
  5. Getting started
  6. ---------------
  7. RLlib's offline dataset APIs enable working with experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in `web applications <https://arxiv.org/abs/1811.00260>`__. You can also log new agent experiences produced during online training for future use.
  8. RLlib represents trajectory sequences (i.e., ``(s, a, r, s', ...)`` tuples) with `SampleBatch <https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py>`__ objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses `policy evaluation <rllib-concepts.html#policy-evaluation>`__ actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.
  9. Example: Training on previously saved experiences
  10. -------------------------------------------------
  11. .. note::
  12. For custom models and enviroments, you'll need to use the `Python API <rllib-training.html#basic-python-api>`__.
  13. In this example, we will save batches of experiences generated during online training to disk, and then leverage this saved data to train a policy offline using DQN. First, we run a simple policy gradient algorithm for 100k steps with ``"output": "/tmp/cartpole-out"`` to tell RLlib to write simulation outputs to the ``/tmp/cartpole-out`` directory.
  14. .. code-block:: bash
  15. $ rllib train \
  16. --run=PG \
  17. --env=CartPole-v1 \
  18. --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' \
  19. --stop='{"timesteps_total": 100000}'
  20. The experiences will be saved in compressed JSON batch format:
  21. .. code-block:: text
  22. $ ls -l /tmp/cartpole-out
  23. total 11636
  24. -rw-rw-r-- 1 eric eric 5022257 output-2019-01-01_15-58-57_worker-0_0.json
  25. -rw-rw-r-- 1 eric eric 5002416 output-2019-01-01_15-59-22_worker-0_1.json
  26. -rw-rw-r-- 1 eric eric 1881666 output-2019-01-01_15-59-47_worker-0_2.json
  27. Then, we can tell DQN to train using these previously generated experiences with ``"input": "/tmp/cartpole-out"``. We disable exploration since it has no effect on the input:
  28. .. code-block:: bash
  29. $ rllib train \
  30. --run=DQN \
  31. --env=CartPole-v1 \
  32. --config='{
  33. "input": "/tmp/cartpole-out",
  34. "explore": false}'
  35. Off-Policy Estimation (OPE)
  36. ---------------------------
  37. In practice, when training on offline data, it is usually not straightforward to evaluate the trained policies using a simulator as in online RL. For example, in recommeder systems, rolling out a policy trained on offline data in a real-world environment can jeopardize your business if the policy is suboptimal. For these situations we can use `off-policy estimation <https://arxiv.org/abs/1911.06854>`__ methods which avoid the risk of evaluating a possibly sub-optimal policy in a real-world environment.
  38. With RLlib's evaluation framework you can:
  39. - Evaluate policies on a simulated environement, if available, using ``evaluation_config["input"] = "sampler"``. You can then monitor your policy's performance on tensorboard as it is getting trained (by using ``tensorboard --logdir=~/ray_results``).
  40. - Use RLlib's off-policy estimation methods, which estimate the policy's performance on a separate offline dataset. To be able to use this feature, the evaluation dataset should contain ``action_prob`` key that represents the action probability distribution of the collected data so that we can do counterfactual evaluation.
  41. RLlib supports the following off-policy estimators:
  42. - `Importance Sampling (IS) <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/importance_sampling.py>`__
  43. - `Weighted Importance Sampling (WIS) <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/weighted_importance_sampling.py>`__
  44. - `Direct Method (DM) <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/direct_method.py>`__
  45. - `Doubly Robust (DR) <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/doubly_robust.py>`__
  46. IS and WIS compute the ratio between the action probabilities under the behavior policy (from the dataset) and the target policy (the policy under evaluation), and use this ratio to estimate the policy's return. More details on this can be found in their respective papers.
  47. DM and DR train a Q-model to compute the estimated return. By default, RLlib uses `Fitted-Q Evaluation (FQE) <https://arxiv.org/abs/1911.06854>`__ to train the Q-model. See `fqe_torch_model.py <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/fqe_torch_model.py>`__ for more details.
  48. .. note:: For a contextual bandit dataset, the ``dones`` key should always be set to ``True``. In this case, FQE reduces to fitting a reward model to the data.
  49. RLlib's OPE estimators output six metrics:
  50. - ``v_behavior``: The discounted sum over rewards in the offline episode, averaged over episodes in the batch.
  51. - ``v_behavior_std``: The standard deviation corresponding to v_behavior.
  52. - ``v_target``: The OPE's estimated discounted return for the target policy, averaged over episodes in the batch.
  53. - ``v_target_std``: The standard deviation corresponding to v_target.
  54. - ``v_gain``: ``v_target / max(v_behavior, 1e-8)``. ``v_gain > 1.0`` indicates that the policy is better than the policy that generated the behavior data. In case, ``v_behavior <= 0``, ``v_delta`` should be used instead for comparison.
  55. - ``v_delta``: The difference between v_target and v_behavior.
  56. As an example, we generate an evaluation dataset for off-policy estimation:
  57. .. code-block:: bash
  58. $ rllib train \
  59. --run=PG \
  60. --env=CartPole-v1 \
  61. --config='{"output": "/tmp/cartpole-eval", "output_max_file_size": 5000000}' \
  62. --stop='{"timesteps_total": 10000}'
  63. .. hint:: You should use separate datasets for algorithm training and OPE, as shown here.
  64. We can now train a DQN algorithm offline and evaluate it using OPE:
  65. .. code-block:: python
  66. from ray.rllib.algorithms.dqn import DQNConfig
  67. from ray.rllib.offline.estimators import (
  68. ImportanceSampling,
  69. WeightedImportanceSampling,
  70. DirectMethod,
  71. DoublyRobust,
  72. )
  73. from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel
  74. config = (
  75. DQNConfig()
  76. .environment(env="CartPole-v1")
  77. .framework("torch")
  78. .offline_data(input_="/tmp/cartpole-out")
  79. .evaluation(
  80. evaluation_interval=1,
  81. evaluation_duration=10,
  82. evaluation_num_workers=1,
  83. evaluation_duration_unit="episodes",
  84. evaluation_config={"input": "/tmp/cartpole-eval"},
  85. off_policy_estimation_methods={
  86. "is": {"type": ImportanceSampling},
  87. "wis": {"type": WeightedImportanceSampling},
  88. "dm_fqe": {
  89. "type": DirectMethod,
  90. "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
  91. },
  92. "dr_fqe": {
  93. "type": DoublyRobust,
  94. "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
  95. },
  96. },
  97. )
  98. )
  99. algo = config.build()
  100. for _ in range(100):
  101. algo.train()
  102. .. image:: images/rllib-offline.png
  103. **Estimator Python API:** For greater control over the evaluation process, you can create off-policy estimators in your Python code and call ``estimator.train(batch)`` to perform any neccessary training and ``estimator.estimate(batch)`` to perform counterfactual estimation. The estimators take in an RLlib Policy object and gamma value for the environment, along with additional estimator-specific arguments (e.g. ``q_model_config`` for DM and DR). You can take a look at the example config parameters of the q_model_config `here <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/fqe_torch_model.py>`__. You can also write your own off-policy estimator by subclassing from the `OffPolicyEstimator <https://github.com/ray-project/ray/blob/master/rllib/offline/estimators/off_policy_estimator.py>`__ base class.
  104. .. code-block:: python
  105. algo = DQN(...)
  106. ... # train policy offline
  107. from ray.rllib.offline.json_reader import JsonReader
  108. from ray.rllib.offline.estimators import DoublyRobust
  109. from ray.rllib.offline.estimators.fqe_torch_model import FQETorchModel
  110. estimator = DoublyRobust(
  111. policy=algo.get_policy(),
  112. gamma=0.99,
  113. q_model_config={"type": FQETorchModel, "n_iters": 160},
  114. )
  115. # Train estimator's Q-model; only required for DM and DR estimators
  116. reader = JsonReader("/tmp/cartpole-out")
  117. for _ in range(100):
  118. batch = reader.next()
  119. print(estimator.train(batch))
  120. # {'loss': ...}
  121. reader = JsonReader("/tmp/cartpole-eval")
  122. # Compute off-policy estimates
  123. for _ in range(100):
  124. batch = reader.next()
  125. print(estimator.estimate(batch))
  126. # {'v_behavior': ..., 'v_target': ..., 'v_gain': ...,
  127. # 'v_behavior_std': ..., 'v_target_std': ..., 'v_delta': ...}
  128. Example: Converting external experiences to batch format
  129. --------------------------------------------------------
  130. When the env does not support simulation (e.g., it is a web application), it is necessary to generate the ``*.json`` experience batch files outside of RLlib. This can be done by using the `JsonWriter <https://github.com/ray-project/ray/blob/master/rllib/offline/json_writer.py>`__ class to write out batches.
  131. This `runnable example <https://github.com/ray-project/ray/blob/master/rllib/examples/saving_experiences.py>`__ shows how to generate and save experience batches for CartPole-v1 to disk:
  132. .. literalinclude:: ../../../rllib/examples/saving_experiences.py
  133. :language: python
  134. :start-after: __sphinx_doc_begin__
  135. :end-before: __sphinx_doc_end__
  136. On-policy algorithms and experience postprocessing
  137. ----------------------------------------------------
  138. RLlib assumes that input batches are of
  139. `postprocessed experiences <https://github.com/ray-project/ray/blob/master/rllib/policy/policy.py#L434>`__.
  140. This isn't typically critical for off-policy algorithms
  141. (e.g., DQN's `post-processing <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn_tf_policy.py#L434>`__
  142. is only needed if ``n_step > 1`` or ``replay_buffer_config.worker_side_prioritization: True``).
  143. For off-policy algorithms, you can also safely set the ``postprocess_inputs: True`` config to auto-postprocess data.
  144. However, for on-policy algorithms like PPO, you'll need to pass in the extra values added during policy evaluation and postprocessing to ``batch_builder.add_values()``, e.g., ``logits``, ``vf_preds``, ``value_target``, and ``advantages`` for PPO. This is needed since the calculation of these values depends on the parameters of the *behaviour* policy, which RLlib does not have access to in the offline setting (in online training, these values are automatically added during policy evaluation).
  145. Note that for on-policy algorithms, you'll also have to throw away experiences generated by prior versions of the policy. This greatly reduces sample efficiency, which is typically undesirable for offline training, but can make sense for certain applications.
  146. Mixing simulation and offline data
  147. -----------------------------------
  148. RLlib supports multiplexing inputs from multiple input sources, including simulation. For example, in the following example we read 40% of our experiences from ``/tmp/cartpole-out``, 30% from ``hdfs:/archive/cartpole``, and the last 30% is produced via policy evaluation. Input sources are multiplexed using `np.random.choice <https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html>`__:
  149. .. code-block:: bash
  150. $ rllib train \
  151. --run=DQN \
  152. --env=CartPole-v1 \
  153. --config='{
  154. "input": {
  155. "/tmp/cartpole-out": 0.4,
  156. "hdfs:/archive/cartpole": 0.3,
  157. "sampler": 0.3,
  158. },
  159. "explore": false}'
  160. Scaling I/O throughput
  161. -----------------------
  162. Similar to scaling online training, you can scale offline I/O throughput by increasing the number of RLlib workers via the ``num_workers`` config. Each worker accesses offline storage independently in parallel, for linear scaling of I/O throughput. Within each read worker, files are chosen in random order for reads, but file contents are read sequentially.
  163. Ray Data Integration
  164. --------------------
  165. RLlib has experimental support for reading/writing training samples from/to large offline datasets using
  166. :ref:`Ray Data <data>`.
  167. We support JSON and Parquet files today. Other file formats supported by Ray Data can also be easily added.
  168. Unlike JSON input, a single dataset can be automatically sharded and replayed by multiple rollout workers
  169. by simply specifying the desired num_workers config.
  170. To load sample data using Dataset, specify input and input_config keys like the following:
  171. .. code-block:: python
  172. config = {
  173. ...
  174. "input"="dataset",
  175. "input_config"={
  176. "format": "json", # json or parquet
  177. # Path to data file or directory.
  178. "path": "/path/to/json_dir/",
  179. # Num of tasks reading dataset in parallel, default is num_workers.
  180. "parallelism": 3,
  181. # Dataset allocates 0.5 CPU for each reader by default.
  182. # Adjust this value based on the size of your offline dataset.
  183. "num_cpus_per_read_task": 0.5,
  184. }
  185. ...
  186. }
  187. To write sample data to JSON or Parquet files using Dataset, specify output and output_config keys like the following:
  188. .. code-block:: python
  189. config = {
  190. "output": "dataset",
  191. "output_config": {
  192. "format": "json", # json or parquet
  193. # Directory to write data files.
  194. "path": "/tmp/test_samples/",
  195. # Break samples into multiple files, each containing about this many records.
  196. "max_num_samples_per_file": 100000,
  197. }
  198. }
  199. Writing Environment Data
  200. --------------------------
  201. To include environment data in the training sample datasets you can use the optional
  202. ``store_infos`` parameter that is part of the ``output_config`` dictionary. This parameter
  203. ensures that the ``infos`` dictionary, as returned by the RL environment, is included in the output files.
  204. .. note:: It is the responsibility of the user to ensure that the content of ``infos`` can be serialized to file.
  205. .. note:: This setting is only relevant for the TensorFlow based agents, for PyTorch agents the ``infos`` data is always stored.
  206. To write the ``infos`` data to JSON or Parquet files using Dataset, specify output and output_config keys like the following:
  207. .. code-block:: python
  208. config = {
  209. "output": "dataset",
  210. "output_config": {
  211. "format": "json", # json or parquet
  212. # Directory to write data files.
  213. "path": "/tmp/test_samples/",
  214. # Write the infos dict data
  215. "store_infos" : True,
  216. }
  217. }
  218. Input Pipeline for Supervised Losses
  219. ------------------------------------
  220. You can also define supervised model losses over offline data. This requires defining a `custom model loss <rllib-models.html#supervised-model-losses>`__. We provide a convenience function, ``InputReader.tf_input_ops()``, that can be used to convert any input reader to a TF input pipeline. For example:
  221. .. code-block:: python
  222. def custom_loss(self, policy_loss):
  223. input_reader = JsonReader("/tmp/cartpole-out")
  224. # print(input_reader.next()) # if you want to access imperatively
  225. input_ops = input_reader.tf_input_ops()
  226. print(input_ops["obs"]) # -> output Tensor shape=[None, 4]
  227. print(input_ops["actions"]) # -> output Tensor shape=[None]
  228. supervised_loss = some_function_of(input_ops)
  229. return policy_loss + supervised_loss
  230. See `custom_model_loss_and_metrics.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_model_loss_and_metrics.py>`__ for a runnable example of using these TF input ops in a custom loss.
  231. Input API
  232. ---------
  233. You can configure experience input for an agent using the following options:
  234. .. tip::
  235. Plain python config dicts will soon be replaced by :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`
  236. objects, which have the advantage of being type safe, allowing users to set different config settings within
  237. meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
  238. construct an Algorithm instance from these config objects (via their ``.build()`` method).
  239. .. code-block:: python
  240. # Specify how to generate experiences:
  241. # - "sampler": Generate experiences via online (env) simulation (default).
  242. # - A local directory or file glob expression (e.g., "/tmp/*.json").
  243. # - A list of individual file paths/URIs (e.g., ["/tmp/1.json",
  244. # "s3://bucket/2.json"]).
  245. # - A dict with string keys and sampling probabilities as values (e.g.,
  246. # {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).
  247. # - A callable that takes an `IOContext` object as only arg and returns a
  248. # ray.rllib.offline.InputReader.
  249. # - A string key that indexes a callable with tune.registry.register_input
  250. "input": "sampler",
  251. # Arguments accessible from the IOContext for configuring custom input
  252. "input_config": {},
  253. # True, if the actions in a given offline "input" are already normalized
  254. # (between -1.0 and 1.0). This is usually the case when the offline
  255. # file has been generated by another RLlib algorithm (e.g. PPO or SAC),
  256. # while "normalize_actions" was set to True.
  257. "actions_in_input_normalized": False,
  258. # Specify how to evaluate the current policy. This only has an effect when
  259. # reading offline experiences ("input" is not "sampler").
  260. # Available options:
  261. # - "simulation": Run the environment in the background, but use
  262. # this data for evaluation only and not for learning.
  263. # - Any subclass of OffPolicyEstimator, e.g.
  264. # ray.rllib.offline.estimators.is::ImportanceSampling or your own custom
  265. # subclass.
  266. "off_policy_estimation_methods": {
  267. "is": {"type": ImportanceSampling},
  268. "wis": {"type": WeightedImportanceSampling}
  269. },
  270. # Whether to run postprocess_trajectory() on the trajectory fragments from
  271. # offline inputs. Note that postprocessing will be done using the *current*
  272. # policy, not the *behavior* policy, which is typically undesirable for
  273. # on-policy algorithms.
  274. "postprocess_inputs": False,
  275. # If positive, input batches will be shuffled via a sliding window buffer
  276. # of this number of batches. Use this if the input data is not in random
  277. # enough order. Input is delayed until the shuffle buffer is filled.
  278. "shuffle_buffer_size": 0,
  279. The interface for a custom input reader is as follows:
  280. .. autoclass:: ray.rllib.offline.InputReader
  281. :members:
  282. :noindex:
  283. Example Custom Input API
  284. ------------------------
  285. You can create a custom input reader like the following:
  286. .. code-block:: python
  287. from ray.rllib.offline import InputReader, IOContext, ShuffledInput
  288. from ray.tune.registry import register_input
  289. class CustomInputReader(InputReader):
  290. def __init__(self, ioctx: IOContext): ...
  291. def next(self): ...
  292. def input_creator(ioctx: IOContext) -> InputReader:
  293. return ShuffledInput(CustomInputReader(ioctx))
  294. register_input("custom_input", input_creator)
  295. config = {
  296. "input": "custom_input",
  297. "input_config": {},
  298. ...
  299. }
  300. You can pass arguments from the config to the custom input api through the
  301. ``input_config`` option which can be accessed with the ``IOContext``.
  302. The interface for the ``IOContext`` is the following:
  303. .. autoclass:: ray.rllib.offline.IOContext
  304. :members:
  305. :noindex:
  306. See `custom_input_api.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_input_api.py>`__ for a runnable example.
  307. Output API
  308. ----------
  309. You can configure experience output for an agent using the following options:
  310. .. tip::
  311. Plain python config dicts will soon be replaced by :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`
  312. objects, which have the advantage of being type safe, allowing users to set different config settings within
  313. meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
  314. construct an Algorithm instance from these config objects (via their ``.build()`` method).
  315. .. code-block:: python
  316. # Specify where experiences should be saved:
  317. # - None: don't save any experiences
  318. # - "logdir" to save to the agent log dir
  319. # - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
  320. # - a function that returns a rllib.offline.OutputWriter
  321. "output": None,
  322. # Arguments accessible from the IOContext for configuring custom output
  323. "output_config": {},
  324. # What sample batch columns to LZ4 compress in the output data.
  325. "output_compress_columns": ["obs", "new_obs"],
  326. # Max output file size (in bytes) before rolling over to a new file.
  327. "output_max_file_size": 64 * 1024 * 1024,
  328. The interface for a custom output writer is as follows:
  329. .. autoclass:: ray.rllib.offline.OutputWriter
  330. :members:
  331. :noindex:
  332. .. include:: /_includes/rllib/announcement_bottom.rst