fault-tolerance.rst 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192
  1. Fault Tolerance
  2. ===============
  3. This document describes how Ray handles machine and process failures.
  4. Tasks
  5. -----
  6. When a worker is executing a task, if the worker dies unexpectedly, either
  7. because the process crashed or because the machine failed, Ray will rerun
  8. the task (after a delay of several seconds) until either the task succeeds
  9. or the maximum number of retries is exceeded. The default number of retries
  10. is 3.
  11. You can experiment with this behavior by running the following code.
  12. .. code-block:: python
  13. import numpy as np
  14. import os
  15. import ray
  16. import time
  17. ray.init(ignore_reinit_error=True)
  18. @ray.remote(max_retries=1)
  19. def potentially_fail(failure_probability):
  20. time.sleep(0.2)
  21. if np.random.random() < failure_probability:
  22. os._exit(0)
  23. return 0
  24. for _ in range(3):
  25. try:
  26. # If this task crashes, Ray will retry it up to one additional
  27. # time. If either of the attempts succeeds, the call to ray.get
  28. # below will return normally. Otherwise, it will raise an
  29. # exception.
  30. ray.get(potentially_fail.remote(0.5))
  31. print('SUCCESS')
  32. except ray.exceptions.WorkerCrashedError:
  33. print('FAILURE')
  34. .. _actor-fault-tolerance:
  35. Actors
  36. ------
  37. Ray will automatically restart actors that crash unexpectedly.
  38. This behavior is controlled using ``max_restarts``,
  39. which sets the maximum number of times that an actor will be restarted.
  40. If 0, the actor won't be restarted. If -1, it will be restarted infinitely.
  41. When an actor is restarted, its state will be recreated by rerunning its
  42. constructor.
  43. After the specified number of restarts, subsequent actor methods will
  44. raise a ``RayActorError``.
  45. You can experiment with this behavior by running the following code.
  46. .. code-block:: python
  47. import os
  48. import ray
  49. import time
  50. ray.init(ignore_reinit_error=True)
  51. @ray.remote(max_restarts=5)
  52. class Actor:
  53. def __init__(self):
  54. self.counter = 0
  55. def increment_and_possibly_fail(self):
  56. self.counter += 1
  57. time.sleep(0.2)
  58. if self.counter == 10:
  59. os._exit(0)
  60. return self.counter
  61. actor = Actor.remote()
  62. # The actor will be restarted up to 5 times. After that, methods will
  63. # always raise a `RayActorError` exception. The actor is restarted by
  64. # rerunning its constructor. Methods that were sent or executing when the
  65. # actor died will also raise a `RayActorError` exception.
  66. for _ in range(100):
  67. try:
  68. counter = ray.get(actor.increment_and_possibly_fail.remote())
  69. print(counter)
  70. except ray.exceptions.RayActorError:
  71. print('FAILURE')
  72. By default, actor tasks execute with at-most-once semantics
  73. (``max_task_retries=0`` in the ``@ray.remote`` decorator). This means that if an
  74. actor task is submitted to an actor that is unreachable, Ray will report the
  75. error with ``RayActorError``, a Python-level exception that is thrown when
  76. ``ray.get`` is called on the future returned by the task. Note that this
  77. exception may be thrown even though the task did indeed execute successfully.
  78. For example, this can happen if the actor dies immediately after executing the
  79. task.
  80. Ray also offers at-least-once execution semantics for actor tasks
  81. (``max_task_retries=-1`` or ``max_task_retries > 0``). This means that if an
  82. actor task is submitted to an actor that is unreachable, the system will
  83. automatically retry the task until it receives a reply from the actor. With
  84. this option, the system will only throw a ``RayActorError`` to the application
  85. if one of the following occurs: (1) the actor’s ``max_restarts`` limit has been
  86. exceeded and the actor cannot be restarted anymore, or (2) the
  87. ``max_task_retries`` limit has been exceeded for this particular task. The
  88. limit can be set to infinity with ``max_task_retries = -1``.
  89. You can experiment with this behavior by running the following code.
  90. .. code-block:: python
  91. import os
  92. import ray
  93. ray.init(ignore_reinit_error=True)
  94. @ray.remote(max_restarts=5, max_task_retries=-1)
  95. class Actor:
  96. def __init__(self):
  97. self.counter = 0
  98. def increment_and_possibly_fail(self):
  99. # Exit after every 10 tasks.
  100. if self.counter == 10:
  101. os._exit(0)
  102. self.counter += 1
  103. return self.counter
  104. actor = Actor.remote()
  105. # The actor will be reconstructed up to 5 times. The actor is
  106. # reconstructed by rerunning its constructor. Methods that were
  107. # executing when the actor died will be retried and will not
  108. # raise a `RayActorError`. Retried methods may execute twice, once
  109. # on the failed actor and a second time on the restarted actor.
  110. for _ in range(50):
  111. counter = ray.get(actor.increment_and_possibly_fail.remote())
  112. print(counter) # Prints the sequence 1-10 5 times.
  113. # After the actor has been restarted 5 times, all subsequent methods will
  114. # raise a `RayActorError`.
  115. for _ in range(10):
  116. try:
  117. counter = ray.get(actor.increment_and_possibly_fail.remote())
  118. print(counter) # Unreachable.
  119. except ray.exceptions.RayActorError:
  120. print('FAILURE') # Prints 10 times.
  121. For at-least-once actors, the system will still guarantee execution ordering
  122. according to the initial submission order. For example, any tasks submitted
  123. after a failed actor task will not execute on the actor until the failed actor
  124. task has been successfully retried. The system will not attempt to re-execute
  125. any tasks that executed successfully before the failure (unless :ref:`object reconstruction <object-reconstruction>` is enabled).
  126. At-least-once execution is best suited for read-only actors or actors with
  127. ephemeral state that does not need to be rebuilt after a failure. For actors
  128. that have critical state, it is best to take periodic checkpoints and either
  129. manually restart the actor or automatically restart the actor with at-most-once
  130. semantics. If the actor’s exact state at the time of failure is needed, the
  131. application is responsible for resubmitting all tasks since the last
  132. checkpoint.
  133. .. note::
  134. For :ref:`async or threaded actors <async-actors>`, the tasks might
  135. be completed out of order. Upon actor restart, the system will only retry
  136. *incomplete* task, in their initial submission order. Previously completed
  137. tasks will not be re-executed.
  138. .. _object-reconstruction:
  139. Objects
  140. -------
  141. Task outputs over a configurable threshold (default 100KB) may be stored in
  142. Ray's distributed object store. Thus, a node failure can cause the loss of a
  143. task output. If this occurs, Ray will automatically attempt to recover the
  144. value by looking for copies of the same object on other nodes. If there are no
  145. other copies left, an ``ObjectLostError`` will be raised.
  146. When there are no copies of an object left, Ray also provides an option to
  147. automatically recover the value by re-executing the task that created the
  148. value. Arguments to the task are recursively reconstructed with the same
  149. method. This option can be enabled with
  150. ``ray.init(_enable_object_reconstruction=True)`` in standalone mode or ``ray
  151. start --enable-object-reconstruction`` in cluster mode.
  152. During reconstruction, each task will only be re-executed up to the specified
  153. number of times, using ``max_retries`` for normal tasks and
  154. ``max_task_retries`` for actor tasks. Both limits can be set to infinity with
  155. the value -1.