internals.rst 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
  1. Tune Internals
  2. ==============
  3. This page overviews the design and architectures of Tune and provides docstrings for internal components.
  4. .. image:: ../../images/tune-arch.png
  5. The blue boxes refer to internal components, and green boxes are public-facing.
  6. Main Components
  7. ---------------
  8. Tune's main components consist of TrialRunner, Trial objects, TrialExecutor, SearchAlg, TrialScheduler, and Trainable.
  9. .. _trial-runner-flow:
  10. This is an illustration of the high-level training flow and how some of the components interact:
  11. *Note: This figure is horizontally scrollable*
  12. .. figure:: ../../images/tune-trial-runner-flow-horizontal.png
  13. :class: horizontal-scroll
  14. TrialRunner
  15. ~~~~~~~~~~~
  16. [`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
  17. This is the main driver of the training loop. This component
  18. uses the TrialScheduler to prioritize and execute trials,
  19. queries the SearchAlgorithm for new
  20. configurations to evaluate, and handles the fault tolerance logic.
  21. **Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
  22. is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
  23. For example, if a node is lost while a trial (specifically, the corresponding
  24. Trainable of the trial) is still executing on that node and checkpointing
  25. is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
  26. from the last available checkpoint when it is run.
  27. The TrialRunner is also in charge of checkpointing the entire experiment execution state
  28. upon each loop iteration. This allows users to restart their experiment
  29. in case of machine failure.
  30. See the docstring at :ref:`trialrunner-docstring`.
  31. Trial objects
  32. ~~~~~~~~~~~~~
  33. [`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
  34. This is an internal data structure that contains metadata about each training run. Each Trial
  35. object is mapped one-to-one with a Trainable object but are not themselves
  36. distributed/remote. Trial objects transition among
  37. the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
  38. ``"TERMINATED"``.
  39. See the docstring at :ref:`trial-docstring`.
  40. TrialExecutor
  41. ~~~~~~~~~~~~~
  42. [`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_executor.py>`__]
  43. The TrialExecutor is a component that interacts with the underlying execution framework.
  44. It also manages resources to ensure the cluster isn't overloaded. By default, the TrialExecutor uses Ray to execute trials.
  45. See the docstring at :ref:`raytrialexecutor-docstring`.
  46. SearchAlg
  47. ~~~~~~~~~
  48. [`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/suggest>`__] The SearchAlgorithm is a user-provided object
  49. that is used for querying new hyperparameter configurations to evaluate.
  50. SearchAlgorithms will be notified every time a trial finishes
  51. executing one training step (of ``train()``), every time a trial
  52. errors, and every time a trial completes.
  53. TrialScheduler
  54. ~~~~~~~~~~~~~~
  55. [`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__] TrialSchedulers operate over a set of possible trials to run,
  56. prioritizing trial execution given available cluster resources.
  57. TrialSchedulers are given the ability to kill or pause trials,
  58. and also are given the ability to reorder/prioritize incoming trials.
  59. Trainables
  60. ~~~~~~~~~~
  61. [`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable.py>`__]
  62. These are user-provided objects that are used for
  63. the training process. If a class is provided, it is expected to conform to the
  64. Trainable interface. If a function is provided. it is wrapped into a
  65. Trainable class, and the function itself is executed on a separate thread.
  66. Trainables will execute one step of ``train()`` before notifying the TrialRunner.
  67. .. _raytrialexecutor-docstring:
  68. RayTrialExecutor
  69. ----------------
  70. .. autoclass:: ray.tune.ray_trial_executor.RayTrialExecutor
  71. :show-inheritance:
  72. :members:
  73. .. _trialexecutor-docstring:
  74. TrialExecutor
  75. -------------
  76. .. autoclass:: ray.tune.trial_executor.TrialExecutor
  77. :members:
  78. .. _trialrunner-docstring:
  79. TrialRunner
  80. -----------
  81. .. autoclass:: ray.tune.trial_runner.TrialRunner
  82. .. _trial-docstring:
  83. Trial
  84. -----
  85. .. autoclass:: ray.tune.trial.Trial
  86. .. _tune-callbacks-docs:
  87. Callbacks
  88. ---------
  89. .. autoclass:: ray.tune.callback.Callback
  90. :members:
  91. .. _resources-docstring:
  92. PlacementGroupFactory
  93. ---------------------
  94. .. autoclass:: ray.tune.utils.placement_groups.PlacementGroupFactory
  95. Registry
  96. --------
  97. .. autofunction:: ray.tune.register_trainable
  98. .. autofunction:: ray.tune.register_env