architecture.rst 2.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
  1. .. _train-arch:
  2. .. TODO: the diagram and some of the components (in the given context) are outdated.
  3. Make sure to fix this.
  4. Ray Train Architecture
  5. ======================
  6. The process of training models with Ray Train consists of several components.
  7. First, depending on the training framework you want to work with, you will have
  8. to provide a so-called ``Trainer`` that manages the training process.
  9. For instance, to use a PyTorch model, you use a ``TorchTrainer``.
  10. The actual training load is distributed among workers on a cluster that belong
  11. to a ``WorkerGroup``.
  12. Each framework has its specific communication protocols and exchange formats,
  13. which is why Ray Train provides ``Backend`` implementations (e.g. ``TorchBackend``)
  14. that can be used to run the training process using a ``BackendExecutor``.
  15. Here's a visual overview of the architecture components of Ray Train:
  16. .. image:: train-arch.svg
  17. :width: 70%
  18. :align: center
  19. Below we discuss each component in a bit more detail.
  20. Trainer
  21. -------
  22. Trainers are your main entry point to the Ray Train API.
  23. Train provides a :ref:`BaseTrainer<train-base-trainer>`, and
  24. many framework-specific Trainers inherit from the derived ``DataParallelTrainer``
  25. (like TensorFlow or Torch) and ``GBDTTrainer`` (like XGBoost or LightGBM).
  26. Defining an actual Trainer, such as ``TorchTrainer`` works as follows:
  27. * You pass in a *function* to the Trainer which defines the training logic.
  28. * The Trainer will create an :ref:`Executor <train-arch-executor>` to run the distributed training.
  29. * The Trainer will handle callbacks based on the results from the executor.
  30. .. _train-arch-backend:
  31. Backend
  32. -------
  33. Backends are used to initialize and manage framework-specific communication protocols.
  34. Each training library (Torch, Horovod, TensorFlow, etc.) has a separate backend
  35. and takes specific configuration values defined in a :ref:`BackendConfig<train-backend-config>`.
  36. Each backend comes with a ``BackendExecutor`` that is used to run the training process.
  37. .. _train-arch-executor:
  38. Executor
  39. --------
  40. The executor is an interface (``BackendExecutor``) that executes distributed training.
  41. It handles the creation of a group of workers (using :ref:`Ray Actors<actor-guide>`)
  42. and is initialized with a :ref:`backend<train-arch-backend>`.
  43. The executor passes all required resources, the number of workers, and information about
  44. worker placement to the ``WorkerGroup``.
  45. WorkerGroup
  46. -----------
  47. The WorkerGroup is a generic utility class for managing a group of Ray Actors.
  48. This is similar in concept to Fiber's `Ring <https://uber.github.io/fiber/experimental/ring/>`_.