123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193 |
- .. _train-getting-started:
- Getting Started with Distributed Model Training in Ray Train
- ============================================================
- Ray Train offers multiple ``Trainers`` which implement scalable model training for different machine learning frameworks.
- Here are examples for some of the commonly used trainers:
- .. tab-set::
- .. tab-item:: XGBoost
- In this example we will train a model using distributed XGBoost.
- First, we load the dataset from S3 using Ray Data and split it into a
- train and validation dataset.
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __xgb_detail_intro_start__
- :end-before: __xgb_detail_intro_end__
- In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
- we configure the number of workers to use:
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __xgb_detail_scaling_start__
- :end-before: __xgb_detail_scaling_end__
- We then instantiate our XGBoostTrainer by passing in:
- - The aforementioned ``ScalingConfig``.
- - The ``label_column`` refers to the column name containing the labels in the Dataset
- - The ``params`` are `XGBoost training parameters <https://xgboost.readthedocs.io/en/stable/parameter.html>`__
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __xgb_detail_training_start__
- :end-before: __xgb_detail_training_end__
- Lastly, we call ``trainer.fit()`` to kick off training and obtain the results.
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __xgb_detail_fit_start__
- :end-before: __xgb_detail_fit_end__
- .. tab-item:: LightGBM
- In this example we will train a model using distributed LightGBM.
- First, we load the dataset from S3 using Ray Data and split it into a
- train and validation dataset.
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __lgbm_detail_intro_start__
- :end-before: __lgbm_detail_intro_end__
- In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
- we configure the number of workers to use:
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __xgb_detail_scaling_start__
- :end-before: __xgb_detail_scaling_end__
- We then instantiate our LightGBMTrainer by passing in:
- - The aforementioned ``ScalingConfig``
- - The ``label_column`` refers to the column name containing the labels in the Dataset
- - The ``params`` are core `LightGBM training parameters <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __lgbm_detail_training_start__
- :end-before: __lgbm_detail_training_end__
- And lastly we call ``trainer.fit()`` to kick off training and obtain the results.
- .. literalinclude:: doc_code/gbdt_user_guide.py
- :language: python
- :start-after: __lgbm_detail_fit_start__
- :end-before: __lgbm_detail_fit_end__
- .. tab-item:: PyTorch
- This example shows how you can use Ray Train with PyTorch.
- First, set up your dataset and model.
- .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
- :language: python
- :start-after: __torch_setup_begin__
- :end-before: __torch_setup_end__
- Now define your single-worker PyTorch training function.
- .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
- :language: python
- :start-after: __torch_single_begin__
- :end-before: __torch_single_end__
- This training function can be executed with:
- .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
- :language: python
- :start-after: __torch_single_run_begin__
- :end-before: __torch_single_run_end__
- :dedent:
- Now let's convert this to a distributed multi-worker training function!
- All you have to do is use the ``ray.train.torch.prepare_model`` and
- ``ray.train.torch.prepare_data_loader`` utility functions to
- easily setup your model & data for distributed training.
- This will automatically wrap your model with ``DistributedDataParallel``
- and place it on the right device, and add ``DistributedSampler`` to your DataLoaders.
- .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
- :language: python
- :start-after: __torch_distributed_begin__
- :end-before: __torch_distributed_end__
- Then, instantiate a ``TorchTrainer``
- with 4 workers, and use it to run the new training function!
- .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
- :language: python
- :start-after: __torch_trainer_begin__
- :end-before: __torch_trainer_end__
- :dedent:
- See :ref:`train-porting-code` for a more comprehensive example.
- .. tab-item:: TensorFlow
- This example shows how you can use Ray Train to set up `Multi-worker training
- with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_.
- First, set up your dataset and model.
- .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
- :language: python
- :start-after: __tf_setup_begin__
- :end-before: __tf_setup_end__
- Now define your single-worker TensorFlow training function.
- .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
- :language: python
- :start-after: __tf_single_begin__
- :end-before: __tf_single_end__
- This training function can be executed with:
- .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
- :language: python
- :start-after: __tf_single_run_begin__
- :end-before: __tf_single_run_end__
- :dedent:
- Now let's convert this to a distributed multi-worker training function!
- All you need to do is:
- 1. Set the per-worker batch size - each worker will process the same size
- batch as in the single-worker code.
- 2. Choose your TensorFlow distributed training strategy. In this example
- we use the ``MultiWorkerMirroredStrategy``.
- .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
- :language: python
- :start-after: __tf_distributed_begin__
- :end-before: __tf_distributed_end__
- Then, instantiate a ``TensorflowTrainer`` with 4 workers,
- and use it to run the new training function!
- .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
- :language: python
- :start-after: __tf_trainer_begin__
- :end-before: __tf_trainer_end__
- :dedent:
- See :ref:`train-porting-code` for a more comprehensive example.
- Next Steps
- ----------
- * To check how your application is doing, you can use the :ref:`Ray dashboard <observability-getting-started>`.
|