openoker
/
ray


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
							.. _train-getting-started:

Getting Started with Distributed Model Training in Ray Train
============================================================

Ray Train offers multiple ``Trainers`` which implement scalable model training for different machine learning frameworks.
Here are examples for some of the commonly used trainers:

.. tab-set::

    .. tab-item:: XGBoost

        In this example we will train a model using distributed XGBoost.

        First, we load the dataset from S3 using Ray Data and split it into a
        train and validation dataset.

        .. literalinclude:: doc_code/gbdt_user_guide.py
           :language: python
           :start-after: __xgb_detail_intro_start__
           :end-before: __xgb_detail_intro_end__

        In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
        we configure the number of workers to use:

        .. literalinclude:: doc_code/gbdt_user_guide.py
           :language: python
           :start-after: __xgb_detail_scaling_start__
           :end-before: __xgb_detail_scaling_end__

        We then instantiate our XGBoostTrainer by passing in:

        - The aforementioned ``ScalingConfig``.
        - The ``label_column`` refers to the column name containing the labels in the Dataset
        - The ``params`` are `XGBoost training parameters <https://xgboost.readthedocs.io/en/stable/parameter.html>`__

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __xgb_detail_training_start__
            :end-before: __xgb_detail_training_end__

        Lastly, we call ``trainer.fit()`` to kick off training and obtain the results.

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __xgb_detail_fit_start__
            :end-before: __xgb_detail_fit_end__

    .. tab-item:: LightGBM

        In this example we will train a model using distributed LightGBM.

        First, we load the dataset from S3 using Ray Data and split it into a
        train and validation dataset.

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __lgbm_detail_intro_start__
            :end-before: __lgbm_detail_intro_end__

        In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
        we configure the number of workers to use:

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __xgb_detail_scaling_start__
            :end-before: __xgb_detail_scaling_end__

        We then instantiate our LightGBMTrainer by passing in:

        - The aforementioned ``ScalingConfig``
        - The ``label_column`` refers to the column name containing the labels in the Dataset
        - The ``params`` are core `LightGBM training parameters <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __lgbm_detail_training_start__
            :end-before: __lgbm_detail_training_end__

        And lastly we call ``trainer.fit()`` to kick off training and obtain the results.

        .. literalinclude:: doc_code/gbdt_user_guide.py
            :language: python
            :start-after: __lgbm_detail_fit_start__
            :end-before: __lgbm_detail_fit_end__

    .. tab-item:: PyTorch

        This example shows how you can use Ray Train with PyTorch.

        First, set up your dataset and model.

        .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
            :language: python
            :start-after: __torch_setup_begin__
            :end-before: __torch_setup_end__


        Now define your single-worker PyTorch training function.

        .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
            :language: python
            :start-after: __torch_single_begin__
            :end-before: __torch_single_end__

        This training function can be executed with:

        .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
            :language: python
            :start-after: __torch_single_run_begin__
            :end-before: __torch_single_run_end__
            :dedent:

        Now let's convert this to a distributed multi-worker training function!

        All you have to do is use the ``ray.train.torch.prepare_model`` and
        ``ray.train.torch.prepare_data_loader`` utility functions to
        easily setup your model & data for distributed training.
        This will automatically wrap your model with ``DistributedDataParallel``
        and place it on the right device, and add ``DistributedSampler`` to your DataLoaders.

        .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
            :language: python
            :start-after: __torch_distributed_begin__
            :end-before: __torch_distributed_end__

        Then, instantiate a ``TorchTrainer``
        with 4 workers, and use it to run the new training function!

        .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
            :language: python
            :start-after: __torch_trainer_begin__
            :end-before: __torch_trainer_end__
            :dedent:

        See :ref:`train-porting-code` for a more comprehensive example.

    .. tab-item:: TensorFlow

        This example shows how you can use Ray Train to set up `Multi-worker training
        with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_.

        First, set up your dataset and model.

        .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
            :language: python
            :start-after: __tf_setup_begin__
            :end-before: __tf_setup_end__

        Now define your single-worker TensorFlow training function.

        .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
            :language: python
            :start-after: __tf_single_begin__
            :end-before: __tf_single_end__

        This training function can be executed with:

        .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
            :language: python
            :start-after: __tf_single_run_begin__
            :end-before: __tf_single_run_end__
            :dedent:

        Now let's convert this to a distributed multi-worker training function!
        All you need to do is:

        1. Set the per-worker batch size - each worker will process the same size
           batch as in the single-worker code.
        2. Choose your TensorFlow distributed training strategy. In this example
           we use the ``MultiWorkerMirroredStrategy``.

        .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
            :language: python
            :start-after: __tf_distributed_begin__
            :end-before: __tf_distributed_end__

        Then, instantiate a ``TensorflowTrainer`` with 4 workers,
        and use it to run the new training function!

        .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
            :language: python
            :start-after: __tf_trainer_begin__
            :end-before: __tf_trainer_end__
            :dedent:

        See :ref:`train-porting-code` for a more comprehensive example.


Next Steps
----------

* To check how your application is doing, you can use the :ref:`Ray dashboard <observability-getting-started>`.