openoker
/
ray


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
							.. _train-config:

Ray Train Configuration User Guide
==================================

The following overviews how to configure scale-out, run options, and fault-tolerance for Train.
For more details on how to configure data ingest, also refer to :ref:`air-ingest`.

Scaling Configurations in Train (``ScalingConfig``)
---------------------------------------------------

The scaling configuration specifies distributed training properties like the number of workers or the
resources per worker.

The properties of the scaling configuration are :ref:`tunable <air-tuner-search-space>`.

.. literalinclude:: doc_code/key_concepts.py
    :language: python
    :start-after: __scaling_config_start__
    :end-before: __scaling_config_end__

.. seealso::

    See the :class:`~ray.air.ScalingConfig` API reference.

.. _train-run-config:

Run Configuration in Train (``RunConfig``)
------------------------------------------

``RunConfig`` is a configuration object used in Ray Train to define the experiment
spec that corresponds to a call to ``trainer.fit()``.

It includes settings such as the experiment name, storage path for results,
stopping conditions, custom callbacks, checkpoint configuration, verbosity level,
and logging options.

Many of these settings are configured through other config objects and passed through
the ``RunConfig``. The following sub-sections contain descriptions of these configs.

The properties of the run configuration are :ref:`not tunable <air-tuner-search-space>`.

.. literalinclude:: doc_code/key_concepts.py
    :language: python
    :start-after: __run_config_start__
    :end-before: __run_config_end__

.. seealso::

    See the :class:`~ray.air.RunConfig` API reference.

    See :ref:`tune-storage-options` for storage configuration examples (related to ``storage_path``).


Failure configurations in Train (``FailureConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The failure configuration specifies how training failures should be dealt with.

As part of the RunConfig, the properties of the failure configuration
are :ref:`not tunable <air-tuner-search-space>`.


.. literalinclude:: doc_code/key_concepts.py
    :language: python
    :start-after: __failure_config_start__
    :end-before: __failure_config_end__

.. seealso::

    See the :class:`~ray.air.FailureConfig` API reference.


Checkpoint configurations in Train (``CheckpointConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The checkpoint configuration specifies how often to checkpoint training state
and how many checkpoints to keep.

As part of the RunConfig, the properties of the checkpoint configuration
are :ref:`not tunable <air-tuner-search-space>`.

.. literalinclude:: doc_code/key_concepts.py
    :language: python
    :start-after: __checkpoint_config_start__
    :end-before: __checkpoint_config_end__

Trainers of certain frameworks including :class:`~ray.train.xgboost.XGBoostTrainer`,
:class:`~ray.train.lightgbm.LightGBMTrainer`, and :class:`~ray.train.huggingface.TransformersTrainer`
implement checkpointing out of the box. For these trainers, checkpointing can be
enabled by setting the checkpoint frequency within the :class:`~ray.air.CheckpointConfig`.

.. literalinclude:: doc_code/key_concepts.py
    :language: python
    :start-after: __checkpoint_config_ckpt_freq_start__
    :end-before: __checkpoint_config_ckpt_freq_end__

.. warning::

    ``checkpoint_frequency`` and other parameters do *not* work for trainers
    that accept a custom training loop such as :class:`~ray.train.torch.TorchTrainer`,
    since checkpointing is fully user-controlled.

.. seealso::

    See the :class:`~ray.air.CheckpointConfig` API reference.
    
**[Experimental] Distributed Checkpoints**: For model parallel workloads where the models do not fit in a single GPU worker, 
it will be important to save and upload the model that is partitioned across different workers. You 
can enable this by setting `_checkpoint_keep_all_ranks=True` to retain the model checkpoints across workers,
and `_checkpoint_upload_from_workers=True` to upload their checkpoints to cloud directly in :class:`~ray.air.CheckpointConfig`. This functionality works for any trainer that inherits from :class:`~ray.train.data_parallel_trainer.DataParallelTrainer`.


Synchronization configurations in Train (``tune.SyncConfig``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``tune.SyncConfig`` specifies how synchronization of results
and checkpoints should happen in a distributed Ray cluster.

As part of the RunConfig, the properties of the failure configuration
are :ref:`not tunable <air-tuner-search-space>`.

.. note::

    This configuration is mostly relevant to running multiple Train runs with a
    Ray Tune. See :ref:`tune-storage-options` for a guide on using the ``SyncConfig``.

.. seealso::

    See the :class:`~ray.tune.syncer.SyncConfig` API reference.