123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566 |
- .. _train-arch:
- .. TODO: the diagram and some of the components (in the given context) are outdated.
- Make sure to fix this.
- Ray Train Architecture
- ======================
- The process of training models with Ray Train consists of several components.
- First, depending on the training framework you want to work with, you will have
- to provide a so-called ``Trainer`` that manages the training process.
- For instance, to use a PyTorch model, you use a ``TorchTrainer``.
- The actual training load is distributed among workers on a cluster that belong
- to a ``WorkerGroup``.
- Each framework has its specific communication protocols and exchange formats,
- which is why Ray Train provides ``Backend`` implementations (e.g. ``TorchBackend``)
- that can be used to run the training process using a ``BackendExecutor``.
- Here's a visual overview of the architecture components of Ray Train:
- .. image:: train-arch.svg
- :width: 70%
- :align: center
- Below we discuss each component in a bit more detail.
- Trainer
- -------
- Trainers are your main entry point to the Ray Train API.
- Train provides a :ref:`BaseTrainer<train-base-trainer>`, and
- many framework-specific Trainers inherit from the derived ``DataParallelTrainer``
- (like TensorFlow or Torch) and ``GBDTTrainer`` (like XGBoost or LightGBM).
- Defining an actual Trainer, such as ``TorchTrainer`` works as follows:
- * You pass in a *function* to the Trainer which defines the training logic.
- * The Trainer will create an :ref:`Executor <train-arch-executor>` to run the distributed training.
- * The Trainer will handle callbacks based on the results from the executor.
- .. _train-arch-backend:
- Backend
- -------
- Backends are used to initialize and manage framework-specific communication protocols.
- Each training library (Torch, Horovod, TensorFlow, etc.) has a separate backend
- and takes specific configuration values defined in a :ref:`BackendConfig<train-backend-config>`.
- Each backend comes with a ``BackendExecutor`` that is used to run the training process.
- .. _train-arch-executor:
- Executor
- --------
- The executor is an interface (``BackendExecutor``) that executes distributed training.
- It handles the creation of a group of workers (using :ref:`Ray Actors<actor-guide>`)
- and is initialized with a :ref:`backend<train-arch-backend>`.
- The executor passes all required resources, the number of workers, and information about
- worker placement to the ``WorkerGroup``.
- WorkerGroup
- -----------
- The WorkerGroup is a generic utility class for managing a group of Ray Actors.
- This is similar in concept to Fiber's `Ring <https://uber.github.io/fiber/experimental/ring/>`_.
|