config_guide.rst 4.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
  1. .. _train-config:
  2. Ray Train Configuration User Guide
  3. ==================================
  4. The following overviews how to configure scale-out, run options, and fault-tolerance for Train.
  5. For more details on how to configure data ingest, also refer to :ref:`air-ingest`.
  6. Scaling Configurations in Train (``ScalingConfig``)
  7. ---------------------------------------------------
  8. The scaling configuration specifies distributed training properties like the number of workers or the
  9. resources per worker.
  10. The properties of the scaling configuration are :ref:`tunable <air-tuner-search-space>`.
  11. .. literalinclude:: doc_code/key_concepts.py
  12. :language: python
  13. :start-after: __scaling_config_start__
  14. :end-before: __scaling_config_end__
  15. .. seealso::
  16. See the :class:`~ray.air.ScalingConfig` API reference.
  17. .. _train-run-config:
  18. Run Configuration in Train (``RunConfig``)
  19. ------------------------------------------
  20. ``RunConfig`` is a configuration object used in Ray Train to define the experiment
  21. spec that corresponds to a call to ``trainer.fit()``.
  22. It includes settings such as the experiment name, storage path for results,
  23. stopping conditions, custom callbacks, checkpoint configuration, verbosity level,
  24. and logging options.
  25. Many of these settings are configured through other config objects and passed through
  26. the ``RunConfig``. The following sub-sections contain descriptions of these configs.
  27. The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.
  28. .. literalinclude:: doc_code/key_concepts.py
  29. :language: python
  30. :start-after: __run_config_start__
  31. :end-before: __run_config_end__
  32. .. seealso::
  33. See the :class:`~ray.air.RunConfig` API reference.
  34. See :ref:`tune-storage-options` for storage configuration examples (related to ``storage_path``).
  35. Failure configurations in Train (``FailureConfig``)
  36. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  37. The failure configuration specifies how training failures should be dealt with.
  38. As part of the RunConfig, the properties of the failure configuration
  39. are :ref:`not tunable <air-tuner-search-space>`.
  40. .. literalinclude:: doc_code/key_concepts.py
  41. :language: python
  42. :start-after: __failure_config_start__
  43. :end-before: __failure_config_end__
  44. .. seealso::
  45. See the :class:`~ray.air.FailureConfig` API reference.
  46. Checkpoint configurations in Train (``CheckpointConfig``)
  47. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  48. The checkpoint configuration specifies how often to checkpoint training state
  49. and how many checkpoints to keep.
  50. As part of the RunConfig, the properties of the checkpoint configuration
  51. are :ref:`not tunable <air-tuner-search-space>`.
  52. .. literalinclude:: doc_code/key_concepts.py
  53. :language: python
  54. :start-after: __checkpoint_config_start__
  55. :end-before: __checkpoint_config_end__
  56. Trainers of certain frameworks including :class:`~ray.train.xgboost.XGBoostTrainer`,
  57. :class:`~ray.train.lightgbm.LightGBMTrainer`, and :class:`~ray.train.huggingface.TransformersTrainer`
  58. implement checkpointing out of the box. For these trainers, checkpointing can be
  59. enabled by setting the checkpoint frequency within the :class:`~ray.air.CheckpointConfig`.
  60. .. literalinclude:: doc_code/key_concepts.py
  61. :language: python
  62. :start-after: __checkpoint_config_ckpt_freq_start__
  63. :end-before: __checkpoint_config_ckpt_freq_end__
  64. .. warning::
  65. ``checkpoint_frequency`` and other parameters do *not* work for trainers
  66. that accept a custom training loop such as :class:`~ray.train.torch.TorchTrainer`,
  67. since checkpointing is fully user-controlled.
  68. .. seealso::
  69. See the :class:`~ray.air.CheckpointConfig` API reference.
  70. **[Experimental] Distributed Checkpoints**: For model parallel workloads where the models do not fit in a single GPU worker,
  71. it will be important to save and upload the model that is partitioned across different workers. You
  72. can enable this by setting `_checkpoint_keep_all_ranks=True` to retain the model checkpoints across workers,
  73. and `_checkpoint_upload_from_workers=True` to upload their checkpoints to cloud directly in :class:`~ray.air.CheckpointConfig`. This functionality works for any trainer that inherits from :class:`~ray.train.data_parallel_trainer.DataParallelTrainer`.
  74. Synchronization configurations in Train (``tune.SyncConfig``)
  75. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  76. The ``tune.SyncConfig`` specifies how synchronization of results
  77. and checkpoints should happen in a distributed Ray cluster.
  78. As part of the RunConfig, the properties of the failure configuration
  79. are :ref:`not tunable <air-tuner-search-space>`.
  80. .. note::
  81. This configuration is mostly relevant to running multiple Train runs with a
  82. Ray Tune. See :ref:`tune-storage-options` for a guide on using the ``SyncConfig``.
  83. .. seealso::
  84. See the :class:`~ray.tune.syncer.SyncConfig` API reference.