gbdt.rst 7.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224
  1. .. _train-gbdt-guide:
  2. XGBoost & LightGBM User Guide for Ray Train
  3. ===========================================
  4. Ray Train has built-in support for XGBoost and LightGBM.
  5. Basic Training with Tree-Based Models in Train
  6. ----------------------------------------------
  7. Just as in the original `xgboost.train() <https://xgboost.readthedocs.io/en/stable/parameter.html>`__ and
  8. `lightgbm.train() <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__ functions, the
  9. training parameters are passed as the ``params`` dictionary.
  10. .. tab-set::
  11. .. tab-item:: XGBoost
  12. Run ``pip install -U xgboost_ray``.
  13. .. literalinclude:: doc_code/gbdt_user_guide.py
  14. :language: python
  15. :start-after: __xgboost_start__
  16. :end-before: __xgboost_end__
  17. .. tab-item:: LightGBM
  18. Run ``pip install -U lightgbm_ray``.
  19. .. literalinclude:: doc_code/gbdt_user_guide.py
  20. :language: python
  21. :start-after: __lightgbm_start__
  22. :end-before: __lightgbm_end__
  23. Ray-specific params are passed in through the trainer constructors.
  24. Saving and Loading XGBoost and LightGBM Checkpoints
  25. ---------------------------------------------------
  26. When a new tree is trained on every boosting round,
  27. it's possible to save a checkpoint to snapshot the training progress so far.
  28. :class:`~ray.train.xgboost.XGBoostTrainer` and :class:`~ray.train.lightgbm.LightGBMTrainer`
  29. both implement checkpointing out of the box.
  30. The only required change is to configure :class:`~ray.air.CheckpointConfig` to set
  31. the checkpointing frequency. For example, the following configuration will
  32. save a checkpoint on every boosting round and will only keep the latest checkpoint:
  33. .. literalinclude:: doc_code/key_concepts.py
  34. :language: python
  35. :start-after: __checkpoint_config_ckpt_freq_start__
  36. :end-before: __checkpoint_config_ckpt_freq_end__
  37. .. tip::
  38. Once checkpointing is enabled, you can follow :ref:`this guide <train-fault-tolerance>`
  39. to enable fault tolerance.
  40. See the :ref:`Trainer restore API reference <trainer-restore>` for more details.
  41. How to scale out training?
  42. --------------------------
  43. The benefit of using Ray AIR is that you can seamlessly scale up your training by
  44. adjusting the :class:`ScalingConfig <ray.air.config.ScalingConfig>`.
  45. .. note::
  46. Ray Train does not modify or otherwise alter the working
  47. of the underlying XGBoost / LightGBM distributed training algorithms.
  48. Ray only provides orchestration, data ingest and fault tolerance.
  49. For more information on GBDT distributed training, refer to
  50. `XGBoost documentation <https://xgboost.readthedocs.io>`__ and
  51. `LightGBM documentation <https://lightgbm.readthedocs.io/>`__.
  52. Here are some examples for common use-cases:
  53. .. tab-set::
  54. .. tab-item:: Multi-node CPU
  55. Setup: 4 nodes with 8 CPUs each.
  56. Use-case: To utilize all resources in multi-node training.
  57. .. literalinclude:: doc_code/gbdt_user_guide.py
  58. :language: python
  59. :start-after: __scaling_cpu_start__
  60. :end-before: __scaling_cpu_end__
  61. Note that we pass 0 CPUs for the trainer resources, so that all resources can
  62. be allocated to the actual distributed training workers.
  63. .. tab-item:: Single-node multi-GPU
  64. Setup: 1 node with 8 CPUs and 4 GPUs.
  65. Use-case: If you have a single node with multiple GPUs, you need to use
  66. distributed training to leverage all GPUs.
  67. .. literalinclude:: doc_code/gbdt_user_guide.py
  68. :language: python
  69. :start-after: __scaling_gpu_start__
  70. :end-before: __scaling_gpu_end__
  71. .. tab-item:: Multi-node multi-GPU
  72. Setup: 4 node with 8 CPUs and 4 GPUs each.
  73. Use-case: If you have a multiple nodes with multiple GPUs, you need to
  74. schedule one worker per GPU.
  75. .. literalinclude:: doc_code/gbdt_user_guide.py
  76. :language: python
  77. :start-after: __scaling_gpumulti_start__
  78. :end-before: __scaling_gpumulti_end__
  79. Note that you just have to adjust the number of workers - everything else
  80. will be handled by Ray automatically.
  81. How many remote actors should I use?
  82. ------------------------------------
  83. This depends on your workload and your cluster setup.
  84. Generally there is no inherent benefit of running more than
  85. one remote actor per node for CPU-only training. This is because
  86. XGBoost can already leverage multiple CPUs via threading.
  87. However, there are some cases when you should consider starting
  88. more than one actor per node:
  89. * For **multi GPU training**, each GPU should have a separate
  90. remote actor. Thus, if your machine has 24 CPUs and 4 GPUs,
  91. you will want to start 4 remote actors with 6 CPUs and 1 GPU
  92. each
  93. * In a **heterogeneous cluster** , you might want to find the
  94. `greatest common divisor <https://en.wikipedia.org/wiki/Greatest_common_divisor>`_
  95. for the number of CPUs.
  96. E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively,
  97. you should set the number of actors to 6 and the CPUs per
  98. actor to 4.
  99. How to use GPUs for training?
  100. -----------------------------
  101. Ray AIR enables multi GPU training for XGBoost and LightGBM. The core backends
  102. will automatically leverage NCCL2 for cross-device communication.
  103. All you have to do is to start one actor per GPU and set GPU-compatible parameters,
  104. e.g. XGBoost's ``tree_method`` to ``gpu_hist`` (see XGBoost
  105. documentation for more details.)
  106. For instance, if you have 2 machines with 4 GPUs each, you will want
  107. to start 8 workers, and set ``use_gpu=True``. There is usually
  108. no benefit in allocating less (e.g. 0.5) or more than one GPU per actor.
  109. You should divide the CPUs evenly across actors per machine, so if your
  110. machines have 16 CPUs in addition to the 4 GPUs, each actor should have
  111. 4 CPUs to use.
  112. .. literalinclude:: doc_code/gbdt_user_guide.py
  113. :language: python
  114. :start-after: __gpu_xgboost_start__
  115. :end-before: __gpu_xgboost_end__
  116. How to optimize XGBoost memory usage?
  117. -------------------------------------
  118. XGBoost uses a compute-optimized datastructure, the ``DMatrix``,
  119. to hold training data. When converting a dataset to a ``DMatrix``,
  120. XGBoost creates intermediate copies and ends up
  121. holding a complete copy of the full data. The data will be converted
  122. into the local dataformat (on a 64 bit system these are 64 bit floats.)
  123. Depending on the system and original dataset dtype, this matrix can
  124. thus occupy more memory than the original dataset.
  125. The **peak memory usage** for CPU-based training is at least
  126. **3x** the dataset size (assuming dtype ``float32`` on a 64bit system)
  127. plus about **400,000 KiB** for other resources,
  128. like operating system requirements and storing of intermediate
  129. results.
  130. **Example**
  131. * Machine type: AWS m5.xlarge (4 vCPUs, 16 GiB RAM)
  132. * Usable RAM: ~15,350,000 KiB
  133. * Dataset: 1,250,000 rows with 1024 features, dtype float32.
  134. Total size: 5,000,000 KiB
  135. * XGBoost DMatrix size: ~10,000,000 KiB
  136. This dataset will fit exactly on this node for training.
  137. Note that the DMatrix size might be lower on a 32 bit system.
  138. **GPUs**
  139. Generally, the same memory requirements exist for GPU-based
  140. training. Additionally, the GPU must have enough memory
  141. to hold the dataset.
  142. In the example above, the GPU must have at least
  143. 10,000,000 KiB (about 9.6 GiB) memory. However,
  144. empirically we found that using a ``DeviceQuantileDMatrix``
  145. seems to show more peak GPU memory usage, possibly
  146. for intermediate storage when loading data (about 10%).
  147. **Best practices**
  148. In order to reduce peak memory usage, consider the following
  149. suggestions:
  150. * Store data as ``float32`` or less. More precision is often
  151. not needed, and keeping data in a smaller format will
  152. help reduce peak memory usage for initial data loading.
  153. * Pass the ``dtype`` when loading data from CSV. Otherwise,
  154. floating point values will be loaded as ``np.float64``
  155. per default, increasing peak memory usage by 33%.