getting-started.rst 7.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
  1. .. _train-getting-started:
  2. Getting Started with Distributed Model Training in Ray Train
  3. ============================================================
  4. Ray Train offers multiple ``Trainers`` which implement scalable model training for different machine learning frameworks.
  5. Here are examples for some of the commonly used trainers:
  6. .. tab-set::
  7. .. tab-item:: XGBoost
  8. In this example we will train a model using distributed XGBoost.
  9. First, we load the dataset from S3 using Ray Data and split it into a
  10. train and validation dataset.
  11. .. literalinclude:: doc_code/gbdt_user_guide.py
  12. :language: python
  13. :start-after: __xgb_detail_intro_start__
  14. :end-before: __xgb_detail_intro_end__
  15. In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
  16. we configure the number of workers to use:
  17. .. literalinclude:: doc_code/gbdt_user_guide.py
  18. :language: python
  19. :start-after: __xgb_detail_scaling_start__
  20. :end-before: __xgb_detail_scaling_end__
  21. We then instantiate our XGBoostTrainer by passing in:
  22. - The aforementioned ``ScalingConfig``.
  23. - The ``label_column`` refers to the column name containing the labels in the Dataset
  24. - The ``params`` are `XGBoost training parameters <https://xgboost.readthedocs.io/en/stable/parameter.html>`__
  25. .. literalinclude:: doc_code/gbdt_user_guide.py
  26. :language: python
  27. :start-after: __xgb_detail_training_start__
  28. :end-before: __xgb_detail_training_end__
  29. Lastly, we call ``trainer.fit()`` to kick off training and obtain the results.
  30. .. literalinclude:: doc_code/gbdt_user_guide.py
  31. :language: python
  32. :start-after: __xgb_detail_fit_start__
  33. :end-before: __xgb_detail_fit_end__
  34. .. tab-item:: LightGBM
  35. In this example we will train a model using distributed LightGBM.
  36. First, we load the dataset from S3 using Ray Data and split it into a
  37. train and validation dataset.
  38. .. literalinclude:: doc_code/gbdt_user_guide.py
  39. :language: python
  40. :start-after: __lgbm_detail_intro_start__
  41. :end-before: __lgbm_detail_intro_end__
  42. In the :class:`ScalingConfig <ray.air.config.ScalingConfig>`,
  43. we configure the number of workers to use:
  44. .. literalinclude:: doc_code/gbdt_user_guide.py
  45. :language: python
  46. :start-after: __xgb_detail_scaling_start__
  47. :end-before: __xgb_detail_scaling_end__
  48. We then instantiate our LightGBMTrainer by passing in:
  49. - The aforementioned ``ScalingConfig``
  50. - The ``label_column`` refers to the column name containing the labels in the Dataset
  51. - The ``params`` are core `LightGBM training parameters <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
  52. .. literalinclude:: doc_code/gbdt_user_guide.py
  53. :language: python
  54. :start-after: __lgbm_detail_training_start__
  55. :end-before: __lgbm_detail_training_end__
  56. And lastly we call ``trainer.fit()`` to kick off training and obtain the results.
  57. .. literalinclude:: doc_code/gbdt_user_guide.py
  58. :language: python
  59. :start-after: __lgbm_detail_fit_start__
  60. :end-before: __lgbm_detail_fit_end__
  61. .. tab-item:: PyTorch
  62. This example shows how you can use Ray Train with PyTorch.
  63. First, set up your dataset and model.
  64. .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
  65. :language: python
  66. :start-after: __torch_setup_begin__
  67. :end-before: __torch_setup_end__
  68. Now define your single-worker PyTorch training function.
  69. .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
  70. :language: python
  71. :start-after: __torch_single_begin__
  72. :end-before: __torch_single_end__
  73. This training function can be executed with:
  74. .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
  75. :language: python
  76. :start-after: __torch_single_run_begin__
  77. :end-before: __torch_single_run_end__
  78. :dedent:
  79. Now let's convert this to a distributed multi-worker training function!
  80. All you have to do is use the ``ray.train.torch.prepare_model`` and
  81. ``ray.train.torch.prepare_data_loader`` utility functions to
  82. easily setup your model & data for distributed training.
  83. This will automatically wrap your model with ``DistributedDataParallel``
  84. and place it on the right device, and add ``DistributedSampler`` to your DataLoaders.
  85. .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
  86. :language: python
  87. :start-after: __torch_distributed_begin__
  88. :end-before: __torch_distributed_end__
  89. Then, instantiate a ``TorchTrainer``
  90. with 4 workers, and use it to run the new training function!
  91. .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_quick_start.py
  92. :language: python
  93. :start-after: __torch_trainer_begin__
  94. :end-before: __torch_trainer_end__
  95. :dedent:
  96. See :ref:`train-porting-code` for a more comprehensive example.
  97. .. tab-item:: TensorFlow
  98. This example shows how you can use Ray Train to set up `Multi-worker training
  99. with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_.
  100. First, set up your dataset and model.
  101. .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
  102. :language: python
  103. :start-after: __tf_setup_begin__
  104. :end-before: __tf_setup_end__
  105. Now define your single-worker TensorFlow training function.
  106. .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
  107. :language: python
  108. :start-after: __tf_single_begin__
  109. :end-before: __tf_single_end__
  110. This training function can be executed with:
  111. .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
  112. :language: python
  113. :start-after: __tf_single_run_begin__
  114. :end-before: __tf_single_run_end__
  115. :dedent:
  116. Now let's convert this to a distributed multi-worker training function!
  117. All you need to do is:
  118. 1. Set the per-worker batch size - each worker will process the same size
  119. batch as in the single-worker code.
  120. 2. Choose your TensorFlow distributed training strategy. In this example
  121. we use the ``MultiWorkerMirroredStrategy``.
  122. .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
  123. :language: python
  124. :start-after: __tf_distributed_begin__
  125. :end-before: __tf_distributed_end__
  126. Then, instantiate a ``TensorflowTrainer`` with 4 workers,
  127. and use it to run the new training function!
  128. .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_quick_start.py
  129. :language: python
  130. :start-after: __tf_trainer_begin__
  131. :end-before: __tf_trainer_end__
  132. :dedent:
  133. See :ref:`train-porting-code` for a more comprehensive example.
  134. Next Steps
  135. ----------
  136. * To check how your application is doing, you can use the :ref:`Ray dashboard <observability-getting-started>`.