user-guide.rst 34 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856
  1. =============================
  2. User Guide & Configuring Tune
  3. =============================
  4. These pages will demonstrate the various features and configurations of Tune.
  5. .. tip:: Before you continue, be sure to have read :ref:`tune-60-seconds`.
  6. This document provides an overview of the core concepts as well as some of the configurations for running Tune.
  7. .. _tune-parallelism:
  8. Resources (Parallelism, GPUs, Distributed)
  9. ------------------------------------------
  10. .. tip:: To run everything sequentially, use :ref:`Ray Local Mode <tune-debugging>`.
  11. Parallelism is determined by ``resources_per_trial`` (defaulting to 1 CPU, 0 GPU per trial) and the resources available to Tune (``ray.cluster_resources()``).
  12. By default, Tune automatically runs N concurrent trials, where N is the number of CPUs (cores) on your machine.
  13. .. code-block:: python
  14. # If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time.
  15. tune.run(trainable, num_samples=10)
  16. You can override this parallelism with ``resources_per_trial``. Here you can
  17. specify your resource requests using either a dictionary or a
  18. :class:`PlacementGroupFactory <ray.tune.utils.placement_groups.PlacementGroupFactory>`
  19. object. In any case, Ray Tune will try to start a placement group for each trial.
  20. .. code-block:: python
  21. # If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
  22. tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2})
  23. # If you have 4 CPUs on your machine, this will run 1 trial at a time.
  24. tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 4})
  25. # Fractional values are also supported, (i.e., {"cpu": 0.5}).
  26. tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 0.5})
  27. Tune will allocate the specified GPU and CPU from ``resources_per_trial`` to each individual trial.
  28. Even if the trial cannot be scheduled right now, Ray Tune will still try to start
  29. the respective placement group. If not enough resources are available, this will trigger
  30. :ref:`autoscaling behavior<cluster-index>` if you're using the Ray cluster launcher.
  31. If your trainable function starts more remote workers, you will need to pass placement groups
  32. factory objects to request these resources. See the
  33. :class:`PlacementGroupFactory documentation <ray.tune.utils.placement_groups.PlacementGroupFactory>`
  34. for further information.
  35. Using GPUs
  36. ~~~~~~~~~~
  37. To leverage GPUs, you must set ``gpu`` in ``tune.run(resources_per_trial)``. This will automatically set ``CUDA_VISIBLE_DEVICES`` for each trial.
  38. .. code-block:: python
  39. # If you have 8 GPUs, this will run 8 trials at once.
  40. tune.run(trainable, num_samples=10, resources_per_trial={"gpu": 1})
  41. # If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
  42. tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})
  43. You can find an example of this in the :doc:`Keras MNIST example </tune/examples/tune_mnist_keras>`.
  44. .. warning:: If 'gpu' is not set, ``CUDA_VISIBLE_DEVICES`` environment variable will be set as empty, disallowing GPU access.
  45. **Troubleshooting**: Occasionally, you may run into GPU memory issues when running a new trial. This may be
  46. due to the previous trial not cleaning up its GPU state fast enough. To avoid this,
  47. you can use ``tune.utils.wait_for_gpu`` - see :ref:`docstring <tune-util-ref>`.
  48. Concurrent samples
  49. ~~~~~~~~~~~~~~~~~~
  50. If using a :ref:`search algorithm <tune-search-alg>`, you may want to limit the number of trials that are being evaluated. For example, you may want to serialize the evaluation of trials to do sequential optimization.
  51. In this case, ``ray.tune.suggest.ConcurrencyLimiter`` to limit the amount of concurrency:
  52. .. code-block:: python
  53. algo = BayesOptSearch(utility_kwargs={
  54. "kind": "ucb",
  55. "kappa": 2.5,
  56. "xi": 0.0
  57. })
  58. algo = ConcurrencyLimiter(algo, max_concurrent=4)
  59. scheduler = AsyncHyperBandScheduler()
  60. See :ref:`limiter` for more details.
  61. Distributed Tuning
  62. ~~~~~~~~~~~~~~~~~~
  63. .. tip:: This section covers how to run Tune across multiple machines. See :ref:`Distributed Training <tune-dist-training>` for guidance in tuning distributed training jobs.
  64. To attach to a Ray cluster, simply run ``ray.init`` before ``tune.run``. See :ref:`start-ray-cli` for more information about ``ray.init``:
  65. .. code-block:: python
  66. # Connect to an existing distributed Ray cluster
  67. ray.init(address=<ray_address>)
  68. tune.run(trainable, num_samples=100, resources_per_trial=tune.PlacementGroupFactory([{"CPU": 2, "GPU": 1}]))
  69. Read more in the Tune :ref:`distributed experiments guide <tune-distributed>`.
  70. .. _tune-dist-training:
  71. Tune Distributed Training
  72. ~~~~~~~~~~~~~~~~~~~~~~~~~
  73. To tune distributed training jobs, Tune provides a set of ``DistributedTrainableCreator`` for different training frameworks.
  74. Below is an example for tuning distributed TensorFlow jobs:
  75. .. code-block:: python
  76. # Please refer to full example in tf_distributed_keras_example.py
  77. from ray.tune.integration.tensorflow import DistributedTrainableCreator
  78. tf_trainable = DistributedTrainableCreator(
  79. train_mnist,
  80. use_gpu=args.use_gpu,
  81. num_workers=2)
  82. tune.run(tf_trainable,
  83. num_samples=1)
  84. Read more about tuning :ref:`distributed PyTorch <tune-ddp-doc>`, :ref:`TensorFlow <tune-dist-tf-doc>` and :ref:`Horovod <tune-integration-horovod>` jobs.
  85. .. _tune-default-search-space:
  86. Search Space (Grid/Random)
  87. --------------------------
  88. You can specify a grid search or sampling distribution via the dict passed into ``tune.run(config=)``.
  89. .. code-block:: python
  90. parameters = {
  91. "qux": tune.sample_from(lambda spec: 2 + 2),
  92. "bar": tune.grid_search([True, False]),
  93. "foo": tune.grid_search([1, 2, 3]),
  94. "baz": "asd", # a constant value
  95. }
  96. tune.run(trainable, config=parameters)
  97. By default, each random variable and grid search point is sampled once. To take multiple random samples, add ``num_samples: N`` to the experiment config. If `grid_search` is provided as an argument, the grid will be repeated ``num_samples`` of times.
  98. .. code-block:: python
  99. :emphasize-lines: 13
  100. # num_samples=10 repeats the 3x3 grid search 10 times, for a total of 90 trials
  101. tune.run(
  102. my_trainable,
  103. name="my_trainable",
  104. config={
  105. "alpha": tune.uniform(100),
  106. "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
  107. "nn_layers": [
  108. tune.grid_search([16, 64, 256]),
  109. tune.grid_search([16, 64, 256]),
  110. ],
  111. },
  112. num_samples=10
  113. )
  114. Note that search spaces may not be interoperable across different search algorithms. For example, for many search algorithms, you will not be able to use a ``grid_search`` parameter. Read about this in the :ref:`Search Space API <tune-search-space>` page.
  115. .. _tune-autofilled-metrics:
  116. Auto-filled Metrics
  117. -------------------
  118. You can log arbitrary values and metrics in both training APIs:
  119. .. code-block:: python
  120. def trainable(config):
  121. for i in range(num_epochs):
  122. ...
  123. tune.report(acc=accuracy, metric_foo=random_metric_1, bar=metric_2)
  124. class Trainable(tune.Trainable):
  125. def step(self):
  126. ...
  127. # don't call report here!
  128. return dict(acc=accuracy, metric_foo=random_metric_1, bar=metric_2)
  129. During training, Tune will automatically log the below metrics in addition to the user-provided values. All of these can be used as stopping conditions or passed as a parameter to Trial Schedulers/Search Algorithms.
  130. * ``config``: The hyperparameter configuration
  131. * ``date``: String-formatted date and time when the result was processed
  132. * ``done``: True if the trial has been finished, False otherwise
  133. * ``episodes_total``: Total number of episodes (for RLLib trainables)
  134. * ``experiment_id``: Unique experiment ID
  135. * ``experiment_tag``: Unique experiment tag (includes parameter values)
  136. * ``hostname``: Hostname of the worker
  137. * ``iterations_since_restore``: The number of times ``tune.report()/trainable.train()`` has been
  138. called after restoring the worker from a checkpoint
  139. * ``node_ip``: Host IP of the worker
  140. * ``pid``: Process ID (PID) of the worker process
  141. * ``time_since_restore``: Time in seconds since restoring from a checkpoint.
  142. * ``time_this_iter_s``: Runtime of the current training iteration in seconds (i.e.
  143. one call to the trainable function or to ``_train()`` in the class API.
  144. * ``time_total_s``: Total runtime in seconds.
  145. * ``timestamp``: Timestamp when the result was processed
  146. * ``timesteps_since_restore``: Number of timesteps since restoring from a checkpoint
  147. * ``timesteps_total``: Total number of timesteps
  148. * ``training_iteration``: The number of times ``tune.report()`` has been
  149. called
  150. * ``trial_id``: Unique trial ID
  151. All of these metrics can be seen in the ``Trial.last_result`` dictionary.
  152. .. _tune-checkpoint:
  153. Checkpointing
  154. -------------
  155. When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. This allows you to:
  156. * save intermediate models throughout training
  157. * use pre-emptible machines (by automatically restoring from last checkpoint)
  158. * Pausing trials when using Trial Schedulers such as HyperBand and PBT.
  159. To use Tune's checkpointing features, you must expose a ``checkpoint_dir`` argument in the function signature, and call ``tune.checkpoint_dir``:
  160. .. code-block:: python
  161. import os
  162. import time
  163. from ray import tune
  164. def train_func(config, checkpoint_dir=None):
  165. start = 0
  166. if checkpoint_dir:
  167. with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
  168. state = json.loads(f.read())
  169. start = state["step"] + 1
  170. for step in range(start, 100):
  171. time.sleep(1)
  172. # Obtain a checkpoint directory
  173. with tune.checkpoint_dir(step=step) as checkpoint_dir:
  174. path = os.path.join(checkpoint_dir, "checkpoint")
  175. with open(path, "w") as f:
  176. f.write(json.dumps({"step": step}))
  177. tune.report(hello="world", ray="tune")
  178. tune.run(train_func)
  179. In this example, checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<step>``.
  180. You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoint_dir>)`` By doing this, you can change whatever experiments' configuration such as the experiment's name:
  181. .. code-block:: python
  182. # Restored previous trial from the given checkpoint
  183. tune.run(
  184. "PG",
  185. name="RestoredExp", # The name can be different.
  186. stop={"training_iteration": 10}, # train 5 more iterations than previous
  187. restore="~/ray_results/Original/PG_<xxx>/checkpoint_5/checkpoint-5",
  188. config={"env": "CartPole-v0"},
  189. )
  190. .. _tune-distributed-checkpointing:
  191. Distributed Checkpointing
  192. ~~~~~~~~~~~~~~~~~~~~~~~~~
  193. On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed.
  194. Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing.
  195. If you are running Ray Tune on Kubernetes, you should usually use a
  196. :func:`DurableTrainable <ray.tune.durable>` or a shared filesystem for checkpoint sharing.
  197. Please :ref`see here for best practices for running Tune on Kubernetes <tune-kubernetes>`.
  198. If you do not use the cluster launcher, you should set up a NFS or global file system and
  199. disable cross-node syncing:
  200. .. code-block:: python
  201. sync_config = tune.SyncConfig(sync_to_driver=False)
  202. tune.run(func, sync_config=sync_config)
  203. Stopping and resuming a tuning run
  204. ----------------------------------
  205. Ray Tune periodically checkpoints the experiment state so that it can be
  206. restarted when it fails or stops. The checkpointing period is
  207. dynamically adjusted so that at least 95% of the time is used for handling
  208. training results and scheduling.
  209. If you send a SIGINT signal to the process running ``tune.run()`` (which is
  210. usually what happens when you press Ctrl+C in the console), Ray Tune shuts
  211. down training gracefully and saves a final experiment-level checkpoint. You
  212. can then call ``tune.run()`` with ``resume=True`` to continue this run in
  213. the future:
  214. .. code-block:: python
  215. :emphasize-lines: 14
  216. tune.run(
  217. train,
  218. # ...
  219. name="my_experiment"
  220. )
  221. # This is interrupted e.g. by sending a SIGINT signal
  222. # Next time, continue the run like so:
  223. tune.run(
  224. train,
  225. # ...
  226. name="my_experiment",
  227. resume=True
  228. )
  229. You will have to pass a ``name`` if you are using ``resume=True`` so that
  230. Ray Tune can detect the experiment folder (which is usually stored at e.g.
  231. ``~/ray_results/my_experiment``). If you forgot to pass a name in the first
  232. call, you can still pass the name when you resume the run. Please note that
  233. in this case it is likely that your experiment name has a date suffix, so if you
  234. ran ``tune.run(my_trainable)``, the ``name`` might look like something like this:
  235. ``my_trainable_2021-01-29_10-16-44``.
  236. You can see which name you need to pass by taking a look at the results table
  237. of your original tuning run:
  238. .. code-block::
  239. :emphasize-lines: 5
  240. == Status ==
  241. Memory usage on this node: 11.0/16.0 GiB
  242. Using FIFO scheduling algorithm.
  243. Resources requested: 1/16 CPUs, 0/0 GPUs, 0.0/4.69 GiB heap, 0.0/1.61 GiB objects
  244. Result logdir: /Users/ray/ray_results/my_trainable_2021-01-29_10-16-44
  245. Number of trials: 1/1 (1 RUNNING)
  246. Handling Large Datasets
  247. -----------------------
  248. You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial.
  249. Tune provides a wrapper function ``tune.with_parameters()`` that allows you to broadcast large objects to your trainable.
  250. Objects passed with this wrapper will be stored on the Ray object store and will be automatically fetched
  251. and passed to your trainable as a parameter.
  252. .. code-block:: python
  253. from ray import tune
  254. import numpy as np
  255. def f(config, data=None):
  256. pass
  257. # use data
  258. data = np.random.random(size=100000000)
  259. tune.run(tune.with_parameters(f, data=data))
  260. .. _tune-stopping:
  261. Stopping Trials
  262. ---------------
  263. You can control when trials are stopped early by passing the ``stop`` argument to ``tune.run``.
  264. This argument takes, a dictionary, a function, or a :class:`Stopper <ray.tune.stopper.Stopper>` class
  265. as an argument.
  266. If a dictionary is passed in, the keys may be any field in the return result of ``tune.report`` in the Function API or ``step()`` (including the results from ``step`` and auto-filled metrics).
  267. In the example below, each trial will be stopped either when it completes 10 iterations OR when it reaches a mean accuracy of 0.98. These metrics are assumed to be **increasing**.
  268. .. code-block:: python
  269. # training_iteration is an auto-filled metric by Tune.
  270. tune.run(
  271. my_trainable,
  272. stop={"training_iteration": 10, "mean_accuracy": 0.98}
  273. )
  274. For more flexibility, you can pass in a function instead. If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean (``True`` if trial should be stopped and ``False`` otherwise).
  275. .. code-block:: python
  276. def stopper(trial_id, result):
  277. return result["mean_accuracy"] / result["training_iteration"] > 5
  278. tune.run(my_trainable, stop=stopper)
  279. Finally, you can implement the :class:`Stopper <ray.tune.stopper.Stopper>` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:
  280. .. code-block:: python
  281. from ray.tune import Stopper
  282. class CustomStopper(Stopper):
  283. def __init__(self):
  284. self.should_stop = False
  285. def __call__(self, trial_id, result):
  286. if not self.should_stop and result['foo'] > 10:
  287. self.should_stop = True
  288. return self.should_stop
  289. def stop_all(self):
  290. """Returns whether to stop trials and prevent new ones from starting."""
  291. return self.should_stop
  292. stopper = CustomStopper()
  293. tune.run(my_trainable, stop=stopper)
  294. Note that in the above example the currently running trials will not stop immediately but will do so once their current iterations are complete.
  295. Ray Tune comes with a set of out-of-the-box stopper classes. See the :ref:`Stopper <tune-stoppers>` documentation.
  296. .. _tune-logging:
  297. Logging
  298. -------
  299. Tune by default will log results for Tensorboard, CSV, and JSON formats. If you need to log something lower level like model weights or gradients, see :ref:`Trainable Logging <trainable-logging>`.
  300. **Learn more about logging and customizations here**: :ref:`loggers-docstring`.
  301. Tune will log the results of each trial to a subfolder under a specified local dir, which defaults to ``~/ray_results``.
  302. .. code-block:: bash
  303. # This logs to 2 different trial folders:
  304. # ~/ray_results/trainable_name/trial_name_1 and ~/ray_results/trainable_name/trial_name_2
  305. # trainable_name and trial_name are autogenerated.
  306. tune.run(trainable, num_samples=2)
  307. You can specify the ``local_dir`` and ``trainable_name``:
  308. .. code-block:: python
  309. # This logs to 2 different trial folders:
  310. # ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
  311. # Only trial_name is autogenerated.
  312. tune.run(trainable, num_samples=2, local_dir="./results", name="test_experiment")
  313. To specify custom trial folder names, you can pass use the ``trial_name_creator`` argument
  314. to `tune.run`. This takes a function with the following signature:
  315. .. code-block:: python
  316. def trial_name_string(trial):
  317. """
  318. Args:
  319. trial (Trial): A generated trial object.
  320. Returns:
  321. trial_name (str): String representation of Trial.
  322. """
  323. return str(trial)
  324. tune.run(
  325. MyTrainableClass,
  326. name="example-experiment",
  327. num_samples=1,
  328. trial_name_creator=trial_name_string
  329. )
  330. See the documentation on Trials: :ref:`trial-docstring`.
  331. .. _tensorboard:
  332. Tensorboard (Logging)
  333. ---------------------
  334. Tune automatically outputs Tensorboard files during ``tune.run``. To visualize learning in tensorboard, install tensorboardX:
  335. .. code-block:: bash
  336. $ pip install tensorboardX
  337. Then, after you run an experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results.
  338. .. code-block:: bash
  339. $ tensorboard --logdir=~/ray_results/my_experiment
  340. If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory:
  341. .. code-block:: bash
  342. $ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results
  343. .. image:: ../ray-tune-tensorboard.png
  344. If using TF2, Tune also automatically generates TensorBoard HParams output, as shown below:
  345. .. code-block:: python
  346. tune.run(
  347. ...,
  348. config={
  349. "lr": tune.grid_search([1e-5, 1e-4]),
  350. "momentum": tune.grid_search([0, 0.9])
  351. }
  352. )
  353. .. image:: ../images/tune-hparams.png
  354. Console Output
  355. --------------
  356. User-provided fields will be outputted automatically on a best-effort basis. You can use a :ref:`Reporter <tune-reporter-doc>` object to customize the console output.
  357. .. code-block:: bash
  358. == Status ==
  359. Memory usage on this node: 11.4/16.0 GiB
  360. Using FIFO scheduling algorithm.
  361. Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
  362. Result logdir: /Users/foo/ray_results/myexp
  363. Number of trials: 4 (4 RUNNING)
  364. +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
  365. | Trial name | status | loc | param1 | param2 | acc | total time (s) | iter |
  366. |----------------------+----------+---------------------+-----------+--------+--------+----------------+-------|
  367. | MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 |
  368. | MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.4865 | 7.0501 | 14 |
  369. | MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.9585 | 7.0477 | 14 |
  370. | MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1797 | 7.05715 | 14 |
  371. +----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
  372. Uploading Results
  373. -----------------
  374. If an upload directory is provided, Tune will automatically sync results from the ``local_dir`` to the given directory, natively supporting standard S3/gsutil/HDFS URIs.
  375. .. code-block:: python
  376. tune.run(
  377. MyTrainableClass,
  378. local_dir="~/ray_results",
  379. sync_config=tune.SyncConfig(upload_dir="s3://my-log-dir")
  380. )
  381. You can customize this to specify arbitrary storages with the ``sync_to_cloud`` argument in ``tune.SyncConfig``. This argument supports either strings with the same replacement fields OR arbitrary functions.
  382. .. code-block:: python
  383. tune.run(
  384. MyTrainableClass,
  385. sync_config=tune.SyncConfig(
  386. upload_dir="s3://my-log-dir",
  387. sync_to_cloud=custom_sync_str_or_func
  388. )
  389. )
  390. If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``s3 sync {source} {target}``. Alternatively, a function can be provided with the following signature:
  391. .. code-block:: python
  392. def custom_sync_func(source, target):
  393. # do arbitrary things inside
  394. sync_cmd = "s3 {source} {target}".format(
  395. source=source,
  396. target=target)
  397. sync_process = subprocess.Popen(sync_cmd, shell=True)
  398. sync_process.wait()
  399. By default, syncing occurs every 300 seconds. To change the frequency of syncing, set the ``TUNE_CLOUD_SYNC_S`` environment variable in the driver to the desired syncing period.
  400. Note that uploading only happens when global experiment state is collected, and the frequency of this is determined by the ``TUNE_GLOBAL_CHECKPOINT_S`` environment variable. So the true upload period is given by ``max(TUNE_CLOUD_SYNC_S, TUNE_GLOBAL_CHECKPOINT_S)``.
  401. .. _tune-docker:
  402. Using Tune with Docker
  403. ----------------------
  404. Tune automatically syncs files and checkpoints between different remote
  405. containers as needed.
  406. To make this work in your Docker cluster, e.g. when you are using the Ray autoscaler
  407. with docker containers, you will need to pass a
  408. ``DockerSyncer`` to the ``sync_to_driver`` argument of ``tune.SyncConfig``.
  409. .. code-block:: python
  410. from ray.tune.integration.docker import DockerSyncer
  411. sync_config = tune.SyncConfig(
  412. sync_to_driver=DockerSyncer)
  413. tune.run(train, sync_config=sync_config)
  414. .. _tune-kubernetes:
  415. Using Tune with Kubernetes
  416. --------------------------
  417. Ray Tune automatically synchronizes files and checkpoints between different remote nodes as needed.
  418. This usually happens via SSH, but this can be a :ref:`performance bottleneck <tune-bottlenecks>`,
  419. especially when running many trials in parallel.
  420. Instead you should use shared storage for checkpoints so that no additional synchronization across nodes
  421. is necessary. There are two main options.
  422. First, you can use a :func:`DurableTrainable <ray.tune.durable>` to store your
  423. logs and checkpoints on cloud storage, such as AWS S3 or Google Cloud Storage:
  424. .. code-block:: python
  425. from ray import tune
  426. tune.run(
  427. tune.durable(train_fn),
  428. # ...,
  429. sync_config=tune.SyncConfig(
  430. sync_to_driver=False,
  431. upload_dir="s3://your-s3-bucket/durable-trial/"
  432. )
  433. )
  434. Second, you can set up a shared file system like NFS. If you do this, disable automatic trial syncing:
  435. .. code-block:: python
  436. from ray import tune
  437. tune.run(
  438. train_fn,
  439. # ...,
  440. local_dir="/path/to/shared/storage",
  441. sync_config=tune.SyncConfig(
  442. # Do not sync to driver because we are on shared storage
  443. sync_to_driver=False
  444. )
  445. )
  446. Lastly, if you still want to use ssh for trial synchronization, but are not running
  447. on the Ray cluster launcher, you might need to pass a
  448. ``KubernetesSyncer`` to the ``sync_to_driver`` argument of ``tune.SyncConfig``.
  449. You have to specify your Kubernetes namespace explicitly:
  450. .. code-block:: python
  451. from ray.tune.integration.kubernetes import NamespacedKubernetesSyncer
  452. sync_config = tune.SyncConfig(
  453. sync_to_driver=NamespacedKubernetesSyncer("ray")
  454. )
  455. tune.run(train, sync_config=sync_config)
  456. Please note that we strongly encourage you to use one of the other two options instead, as they will
  457. result in less overhead and don't require pods to SSH into each other.
  458. .. _tune-log_to_file:
  459. Redirecting stdout and stderr to files
  460. --------------------------------------
  461. The stdout and stderr streams are usually printed to the console. For remote actors,
  462. Ray collects these logs and prints them to the head process.
  463. However, if you would like to collect the stream outputs in files for later
  464. analysis or troubleshooting, Tune offers an utility parameter, ``log_to_file``,
  465. for this.
  466. By passing ``log_to_file=True`` to ``tune.run()``, stdout and stderr will be logged
  467. to ``trial_logdir/stdout`` and ``trial_logdir/stderr``, respectively:
  468. .. code-block:: python
  469. tune.run(
  470. trainable,
  471. log_to_file=True)
  472. If you would like to specify the output files, you can either pass one filename,
  473. where the combined output will be stored, or two filenames, for stdout and stderr,
  474. respectively:
  475. .. code-block:: python
  476. tune.run(
  477. trainable,
  478. log_to_file="std_combined.log")
  479. tune.run(
  480. trainable,
  481. log_to_file=("my_stdout.log", "my_stderr.log"))
  482. The file names are relative to the trial's logdir. You can pass absolute paths,
  483. too.
  484. If ``log_to_file`` is set, Tune will automatically register a new logging handler
  485. for Ray's base logger and log the output to the specified stderr output file.
  486. .. _tune-callbacks:
  487. Callbacks
  488. ---------
  489. Ray Tune supports callbacks that are called during various times of the training process.
  490. Callbacks can be passed as a parameter to ``tune.run()``, and the submethod will be
  491. invoked automatically.
  492. This simple callback just prints a metric each time a result is received:
  493. .. code-block:: python
  494. from ray import tune
  495. from ray.tune import Callback
  496. class MyCallback(Callback):
  497. def on_trial_result(self, iteration, trials, trial, result, **info):
  498. print(f"Got result: {result['metric']}")
  499. def train(config):
  500. for i in range(10):
  501. tune.report(metric=i)
  502. tune.run(
  503. train,
  504. callbacks=[MyCallback()])
  505. For more details and available hooks, please :ref:`see the API docs for Ray Tune callbacks <tune-callbacks-docs>`.
  506. .. _tune-debugging:
  507. Debugging
  508. ---------
  509. By default, Tune will run hyperparameter evaluations on multiple processes. However, if you need to debug your training process, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process with ``local_mode`` by calling the following before ``tune.run``.
  510. .. code-block:: python
  511. ray.init(local_mode=True)
  512. Local mode with multiple configuration evaluations will interleave computation, so it is most naturally used when running a single configuration evaluation.
  513. Note that ``local_mode`` has some known issues, so please read :ref:`these tips <local-mode-tips>` for more info.
  514. Stopping after the first failure
  515. --------------------------------
  516. By default, ``tune.run`` will continue executing until all trials have terminated or errored. To stop the entire Tune run as soon as **any** trial errors:
  517. .. code-block:: python
  518. tune.run(trainable, fail_fast=True)
  519. This is useful when you are trying to setup a large hyperparameter experiment.
  520. Environment variables
  521. ---------------------
  522. Some of Ray Tune's behavior can be configured using environment variables.
  523. These are the environment variables Ray Tune currently considers:
  524. * **TUNE_CLUSTER_SSH_KEY**: SSH key used by the Tune driver process to connect
  525. to remote cluster machines for checkpoint syncing. If this is not set,
  526. ``~/ray_bootstrap_key.pem`` will be used.
  527. * **TUNE_DISABLE_AUTO_CALLBACK_LOGGERS**: Ray Tune automatically adds a CSV and
  528. JSON logger callback if they haven't been passed. Setting this variable to
  529. `1` disables this automatic creation. Please note that this will most likely
  530. affect analyzing your results after the tuning run.
  531. * **TUNE_DISABLE_AUTO_CALLBACK_SYNCER**: Ray Tune automatically adds a
  532. Syncer callback to sync logs and checkpoints between different nodes if none
  533. has been passed. Setting this variable to `1` disables this automatic creation.
  534. Please note that this will most likely affect advanced scheduling algorithms
  535. like PopulationBasedTraining.
  536. * **TUNE_DISABLE_AUTO_INIT**: Disable automatically calling ``ray.init()`` if
  537. not attached to a Ray session.
  538. * **TUNE_DISABLE_DATED_SUBDIR**: Ray Tune automatically adds a date string to experiment
  539. directories when the name is not specified explicitly or the trainable isn't passed
  540. as a string. Setting this environment variable to ``1`` disables adding these date strings.
  541. * **TUNE_DISABLE_STRICT_METRIC_CHECKING**: When you report metrics to Tune via
  542. ``tune.report()`` and passed a ``metric`` parameter to ``tune.run()``, a scheduler,
  543. or a search algorithm, Tune will error
  544. if the metric was not reported in the result. Setting this environment variable
  545. to ``1`` will disable this check.
  546. * **TUNE_DISABLE_SIGINT_HANDLER**: Ray Tune catches SIGINT signals (e.g. sent by
  547. Ctrl+C) to gracefully shutdown and do a final checkpoint. Setting this variable
  548. to ``1`` will disable signal handling and stop execution right away. Defaults to
  549. ``0``.
  550. * **TUNE_FUNCTION_THREAD_TIMEOUT_S**: Time in seconds the function API waits
  551. for threads to finish after instructing them to complete. Defaults to ``2``.
  552. * **TUNE_GLOBAL_CHECKPOINT_S**: Time in seconds that limits how often Tune's
  553. experiment state is checkpointed. If not set this will default to ``10``.
  554. * **TUNE_MAX_LEN_IDENTIFIER**: Maximum length of trial subdirectory names (those
  555. with the parameter values in them)
  556. * **TUNE_MAX_PENDING_TRIALS_PG**: Maximum number of pending trials when placement groups are used. Defaults
  557. to ``auto``, which will be updated to ``1000`` for random/grid search and ``1`` for any other search algorithms.
  558. * **TUNE_PLACEMENT_GROUP_AUTO_DISABLED**: Ray Tune automatically uses placement groups
  559. instead of the legacy resource requests. Setting this to 1 enables legacy placement.
  560. * **TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED**: Ray Tune cleans up existing placement groups
  561. with the ``_tune__`` prefix in their name before starting a run. This is used to make sure
  562. that scheduled placement groups are removed when multiple calls to ``tune.run()`` are
  563. done in the same script. You might want to disable this if you run multiple Tune runs in
  564. parallel from different scripts. Set to 1 to disable.
  565. * **TUNE_PLACEMENT_GROUP_PREFIX**: Prefix for placement groups created by Ray Tune. This prefix is used
  566. e.g. to identify placement groups that should be cleaned up on start/stop of the tuning run. This is
  567. initialized to a unique name at the start of the first run.
  568. * **TUNE_PLACEMENT_GROUP_RECON_INTERVAL**: How often to reconcile placement groups. Reconcilation is
  569. used to make sure that the number of requested placement groups and pending/running trials are in sync.
  570. In normal circumstances these shouldn't differ anyway, but reconcilation makes sure to capture cases when
  571. placement groups are manually destroyed. Reconcilation doesn't take much time, but it can add up when
  572. running a large number of short trials. Defaults to every ``5`` (seconds).
  573. * **TUNE_PLACEMENT_GROUP_WAIT_S**: Default time the trial executor waits for placement
  574. groups to be placed before continuing the tuning loop. Setting this to a float
  575. will block for that many seconds. This is mostly used for testing purposes. Defaults
  576. to -1, which disables blocking.
  577. * **TUNE_RESULT_DIR**: Directory where Ray Tune trial results are stored. If this
  578. is not set, ``~/ray_results`` will be used.
  579. * **TUNE_RESULT_BUFFER_LENGTH**: Ray Tune can buffer results from trainables before they are passed
  580. to the driver. Enabling this might delay scheduling decisions, as trainables are speculatively
  581. continued. Setting this to ``0`` disables result buffering. Defaults to 1000 (results).
  582. * **TUNE_RESULT_BUFFER_MAX_TIME_S**: Similarly, Ray Tune buffers results up to ``number_of_trial/10`` seconds,
  583. but never longer than this value. Defaults to 100 (seconds).
  584. * **TUNE_RESULT_BUFFER_MIN_TIME_S**: Additionally, you can specify a minimum time to buffer results. Defaults to 0.
  585. * **TUNE_SYNCER_VERBOSITY**: Amount of command output when using Tune with Docker Syncer. Defaults to 0.
  586. * **TUNE_TRIAL_STARTUP_GRACE_PERIOD**: Amount of time after starting a trial that Ray Tune checks for successful
  587. trial startups. After the grace period, Tune will block until a result from a running trial is received. Can
  588. be disabled by setting this to lower or equal to 0.
  589. * **TUNE_WARN_THRESHOLD_S**: Threshold for logging if an Tune event loop operation takes too long. Defaults to 0.5 (seconds).
  590. * **TUNE_STATE_REFRESH_PERIOD**: Frequency of updating the resource tracking from Ray. Defaults to 10 (seconds).
  591. There are some environment variables that are mostly relevant for integrated libraries:
  592. * **SIGOPT_KEY**: SigOpt API access key.
  593. * **WANDB_API_KEY**: Weights and Biases API key. You can also use ``wandb login``
  594. instead.
  595. Further Questions or Issues?
  596. ----------------------------
  597. .. include:: /_help.rst