troubleshooting.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401
  1. Debugging and Profiling
  2. =======================
  3. Observing Ray Work
  4. ------------------
  5. You can run ``ray stack`` to dump the stack traces of all Ray workers on
  6. the current node. This requires ``py-spy`` to be installed. See the `Troubleshooting page <troubleshooting.html>`_ for more details.
  7. Visualizing Tasks in the Ray Timeline
  8. -------------------------------------
  9. The most important tool is the timeline visualization tool. To visualize tasks
  10. in the Ray timeline, you can dump the timeline as a JSON file by running ``ray
  11. timeline`` from the command line or by using the following command.
  12. .. code-block:: python
  13. ray.timeline(filename="/tmp/timeline.json")
  14. Then open `chrome://tracing`_ in the Chrome web browser, and load
  15. ``timeline.json``.
  16. .. _`chrome://tracing`: chrome://tracing
  17. Profiling Using Python's CProfile
  18. ---------------------------------
  19. A second way to profile the performance of your Ray application is to
  20. use Python's native cProfile `profiling module`_. Rather than tracking
  21. line-by-line of your application code, cProfile can give the total runtime
  22. of each loop function, as well as list the number of calls made and
  23. execution time of all function calls made within the profiled code.
  24. .. _`profiling module`: https://docs.python.org/3/library/profile.html#module-cProfile
  25. Unlike ``line_profiler`` above, this detailed list of profiled function calls
  26. **includes** internal function calls and function calls made within Ray!
  27. However, similar to ``line_profiler``, cProfile can be enabled with minimal
  28. changes to your application code (given that each section of the code you want
  29. to profile is defined as its own function). To use cProfile, add an import
  30. statement, then replace calls to the loop functions as follows:
  31. .. code-block:: python
  32. import cProfile # Added import statement
  33. def ex1():
  34. list1 = []
  35. for i in range(5):
  36. list1.append(ray.get(func.remote()))
  37. def main():
  38. ray.init()
  39. cProfile.run('ex1()') # Modified call to ex1
  40. cProfile.run('ex2()')
  41. cProfile.run('ex3()')
  42. if __name__ == "__main__":
  43. main()
  44. Now, when executing your Python script, a cProfile list of profiled function
  45. calls will be outputted to terminal for each call made to ``cProfile.run()``.
  46. At the very top of cProfile's output gives the total execution time for
  47. ``'ex1()'``:
  48. .. code-block:: bash
  49. 601 function calls (595 primitive calls) in 2.509 seconds
  50. Following is a snippet of profiled function calls for ``'ex1()'``. Most of
  51. these calls are quick and take around 0.000 seconds, so the functions of
  52. interest are the ones with non-zero execution times:
  53. .. code-block:: bash
  54. ncalls tottime percall cumtime percall filename:lineno(function)
  55. ...
  56. 1 0.000 0.000 2.509 2.509 your_script_here.py:31(ex1)
  57. 5 0.000 0.000 0.001 0.000 remote_function.py:103(remote)
  58. 5 0.000 0.000 0.001 0.000 remote_function.py:107(_remote)
  59. ...
  60. 10 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
  61. 5 0.000 0.000 2.508 0.502 worker.py:2535(get)
  62. 5 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
  63. 10 0.000 0.000 2.507 0.251 worker.py:374(retrieve_and_deserialize)
  64. 5 0.000 0.000 2.508 0.502 worker.py:424(get_object)
  65. 5 0.000 0.000 0.000 0.000 worker.py:514(submit_task)
  66. ...
  67. The 5 separate calls to Ray's ``get``, taking the full 0.502 seconds each call,
  68. can be noticed at ``worker.py:2535(get)``. Meanwhile, the act of calling the
  69. remote function itself at ``remote_function.py:103(remote)`` only takes 0.001
  70. seconds over 5 calls, and thus is not the source of the slow performance of
  71. ``ex1()``.
  72. Profiling Ray Actors with cProfile
  73. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  74. Considering that the detailed output of cProfile can be quite different depending
  75. on what Ray functionalities we use, let us see what cProfile's output might look
  76. like if our example involved Actors (for an introduction to Ray actors, see our
  77. `Actor documentation here`_).
  78. .. _`Actor documentation here`: http://docs.ray.io/en/master/actors.html
  79. Now, instead of looping over five calls to a remote function like in ``ex1``,
  80. let's create a new example and loop over five calls to a remote function
  81. **inside an actor**. Our actor's remote function again just sleeps for 0.5
  82. seconds:
  83. .. code-block:: python
  84. # Our actor
  85. @ray.remote
  86. class Sleeper(object):
  87. def __init__(self):
  88. self.sleepValue = 0.5
  89. # Equivalent to func(), but defined within an actor
  90. def actor_func(self):
  91. time.sleep(self.sleepValue)
  92. Recalling the suboptimality of ``ex1``, let's first see what happens if we
  93. attempt to perform all five ``actor_func()`` calls within a single actor:
  94. .. code-block:: python
  95. def ex4():
  96. # This is suboptimal in Ray, and should only be used for the sake of this example
  97. actor_example = Sleeper.remote()
  98. five_results = []
  99. for i in range(5):
  100. five_results.append(actor_example.actor_func.remote())
  101. # Wait until the end to call ray.get()
  102. ray.get(five_results)
  103. We enable cProfile on this example as follows:
  104. .. code-block:: python
  105. def main():
  106. ray.init()
  107. cProfile.run('ex4()')
  108. if __name__ == "__main__":
  109. main()
  110. Running our new Actor example, cProfile's abbreviated output is as follows:
  111. .. code-block:: bash
  112. 12519 function calls (11956 primitive calls) in 2.525 seconds
  113. ncalls tottime percall cumtime percall filename:lineno(function)
  114. ...
  115. 1 0.000 0.000 0.015 0.015 actor.py:546(remote)
  116. 1 0.000 0.000 0.015 0.015 actor.py:560(_remote)
  117. 1 0.000 0.000 0.000 0.000 actor.py:697(__init__)
  118. ...
  119. 1 0.000 0.000 2.525 2.525 your_script_here.py:63(ex4)
  120. ...
  121. 9 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
  122. 1 0.000 0.000 2.509 2.509 worker.py:2535(get)
  123. 9 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
  124. 4 0.000 0.000 2.508 0.627 worker.py:374(retrieve_and_deserialize)
  125. 1 0.000 0.000 2.509 2.509 worker.py:424(get_object)
  126. 8 0.000 0.000 0.001 0.000 worker.py:514(submit_task)
  127. ...
  128. It turns out that the entire example still took 2.5 seconds to execute, or the
  129. time for five calls to ``actor_func()`` to run in serial. We remember in ``ex1``
  130. that this behavior was because we did not wait until after submitting all five
  131. remote function tasks to call ``ray.get()``, but we can verify on cProfile's
  132. output line ``worker.py:2535(get)`` that ``ray.get()`` was only called once at
  133. the end, for 2.509 seconds. What happened?
  134. It turns out Ray cannot parallelize this example, because we have only
  135. initialized a single ``Sleeper`` actor. Because each actor is a single,
  136. stateful worker, our entire code is submitted and ran on a single worker the
  137. whole time.
  138. To better parallelize the actors in ``ex4``, we can take advantage
  139. that each call to ``actor_func()`` is independent, and instead
  140. create five ``Sleeper`` actors. That way, we are creating five workers
  141. that can run in parallel, instead of creating a single worker that
  142. can only handle one call to ``actor_func()`` at a time.
  143. .. code-block:: python
  144. def ex4():
  145. # Modified to create five separate Sleepers
  146. five_actors = [Sleeper.remote() for i in range(5)]
  147. # Each call to actor_func now goes to a different Sleeper
  148. five_results = []
  149. for actor_example in five_actors:
  150. five_results.append(actor_example.actor_func.remote())
  151. ray.get(five_results)
  152. Our example in total now takes only 1.5 seconds to run:
  153. .. code-block:: bash
  154. 1378 function calls (1363 primitive calls) in 1.567 seconds
  155. ncalls tottime percall cumtime percall filename:lineno(function)
  156. ...
  157. 5 0.000 0.000 0.002 0.000 actor.py:546(remote)
  158. 5 0.000 0.000 0.002 0.000 actor.py:560(_remote)
  159. 5 0.000 0.000 0.000 0.000 actor.py:697(__init__)
  160. ...
  161. 1 0.000 0.000 1.566 1.566 your_script_here.py:71(ex4)
  162. ...
  163. 21 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
  164. 1 0.000 0.000 1.564 1.564 worker.py:2535(get)
  165. 25 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
  166. 3 0.000 0.000 1.564 0.521 worker.py:374(retrieve_and_deserialize)
  167. 1 0.000 0.000 1.564 1.564 worker.py:424(get_object)
  168. 20 0.001 0.000 0.001 0.000 worker.py:514(submit_task)
  169. ...
  170. This document discusses some common problems that people run into when using Ray
  171. as well as some known problems. If you encounter other problems, please
  172. `let us know`_.
  173. .. _`let us know`: https://github.com/ray-project/ray/issues
  174. Crashes
  175. -------
  176. If Ray crashed, you may wonder what happened. Currently, this can occur for some
  177. of the following reasons.
  178. - **Stressful workloads:** Workloads that create many many tasks in a short
  179. amount of time can sometimes interfere with the heartbeat mechanism that we
  180. use to check that processes are still alive. On the head node in the cluster,
  181. you can check the files ``/tmp/ray/session_*/logs/monitor*``. They will
  182. indicate which processes Ray has marked as dead (due to a lack of heartbeats).
  183. However, it is currently possible for a process to get marked as dead without
  184. actually having died.
  185. - **Starting many actors:** Workloads that start a large number of actors all at
  186. once may exhibit problems when the processes (or libraries that they use)
  187. contend for resources. Similarly, a script that starts many actors over the
  188. lifetime of the application will eventually cause the system to run out of
  189. file descriptors. This is addressable, but currently we do not garbage collect
  190. actor processes until the script finishes.
  191. - **Running out of file descriptors:** As a workaround, you may be able to
  192. increase the maximum number of file descriptors with a command like
  193. ``ulimit -n 65536``. If that fails, double check that the hard limit is
  194. sufficiently large by running ``ulimit -Hn``. If it is too small, you can
  195. increase the hard limit as follows (these instructions work on EC2).
  196. * Increase the hard ulimit for open file descriptors system-wide by running
  197. the following.
  198. .. code-block:: bash
  199. sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
  200. * Logout and log back in.
  201. No Speedup
  202. ----------
  203. You just ran an application using Ray, but it wasn't as fast as you expected it
  204. to be. Or worse, perhaps it was slower than the serial version of the
  205. application! The most common reasons are the following.
  206. - **Number of cores:** How many cores is Ray using? When you start Ray, it will
  207. determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray
  208. usually will not schedule more tasks in parallel than the number of CPUs. So
  209. if the number of CPUs is 4, the most you should expect is a 4x speedup.
  210. - **Physical versus logical CPUs:** Do the machines you're running on have fewer
  211. **physical** cores than **logical** cores? You can check the number of logical
  212. cores with ``psutil.cpu_count()`` and the number of physical cores with
  213. ``psutil.cpu_count(logical=False)``. This is common on a lot of machines and
  214. especially on EC2. For many workloads (especially numerical workloads), you
  215. often cannot expect a greater speedup than the number of physical CPUs.
  216. - **Small tasks:** Are your tasks very small? Ray introduces some overhead for
  217. each task (the amount of overhead depends on the arguments that are passed
  218. in). You will be unlikely to see speedups if your tasks take less than ten
  219. milliseconds. For many workloads, you can easily increase the sizes of your
  220. tasks by batching them together.
  221. - **Variable durations:** Do your tasks have variable duration? If you run 10
  222. tasks with variable duration in parallel, you shouldn't expect an N-fold
  223. speedup (because you'll end up waiting for the slowest task). In this case,
  224. consider using ``ray.wait`` to begin processing tasks that finish first.
  225. - **Multi-threaded libraries:** Are all of your tasks attempting to use all of
  226. the cores on the machine? If so, they are likely to experience contention and
  227. prevent your application from achieving a speedup. This is very common with
  228. some versions of ``numpy``, and in that case can usually be setting an
  229. environment variable like ``MKL_NUM_THREADS`` (or the equivalent depending
  230. on your installation) to ``1``.
  231. For many - but not all - libraries, you can diagnose this by opening ``top``
  232. while your application is running. If one process is using most of the CPUs,
  233. and the others are using a small amount, this may be the problem. The most
  234. common exception is PyTorch, which will appear to be using all the cores
  235. despite needing ``torch.set_num_threads(1)`` to be called to avoid contention.
  236. If you are still experiencing a slowdown, but none of the above problems apply,
  237. we'd really like to know! Please create a `GitHub issue`_ and consider
  238. submitting a minimal code example that demonstrates the problem.
  239. .. _`Github issue`: https://github.com/ray-project/ray/issues
  240. Outdated Function Definitions
  241. -----------------------------
  242. Due to subtleties of Python, if you redefine a remote function, you may not
  243. always get the expected behavior. In this case, it may be that Ray is not
  244. running the newest version of the function.
  245. Suppose you define a remote function ``f`` and then redefine it. Ray should use
  246. the newest version.
  247. .. code-block:: python
  248. @ray.remote
  249. def f():
  250. return 1
  251. @ray.remote
  252. def f():
  253. return 2
  254. ray.get(f.remote()) # This should be 2.
  255. However, the following are cases where modifying the remote function will
  256. not update Ray to the new version (at least without stopping and restarting
  257. Ray).
  258. - **The function is imported from an external file:** In this case,
  259. ``f`` is defined in some external file ``file.py``. If you ``import file``,
  260. change the definition of ``f`` in ``file.py``, then re-``import file``,
  261. the function ``f`` will not be updated.
  262. This is because the second import gets ignored as a no-op, so ``f`` is
  263. still defined by the first import.
  264. A solution to this problem is to use ``reload(file)`` instead of a second
  265. ``import file``. Reloading causes the new definition of ``f`` to be
  266. re-executed, and exports it to the other machines. Note that in Python 3, you
  267. need to do ``from importlib import reload``.
  268. - **The function relies on a helper function from an external file:**
  269. In this case, ``f`` can be defined within your Ray application, but relies
  270. on a helper function ``h`` defined in some external file ``file.py``. If the
  271. definition of ``h`` gets changed in ``file.py``, redefining ``f`` will not
  272. update Ray to use the new version of ``h``.
  273. This is because when ``f`` first gets defined, its definition is shipped to
  274. all of the workers, and is unpickled. During unpickling, ``file.py`` gets
  275. imported in the workers. Then when ``f`` gets redefined, its definition is
  276. again shipped and unpickled in all of the workers. But since ``file.py``
  277. has been imported in the workers already, it is treated as a second import
  278. and is ignored as a no-op.
  279. Unfortunately, reloading on the driver does not update ``h``, as the reload
  280. needs to happen on the worker.
  281. A solution to this problem is to redefine ``f`` to reload ``file.py`` before
  282. it calls ``h``. For example, if inside ``file.py`` you have
  283. .. code-block:: python
  284. def h():
  285. return 1
  286. And you define remote function ``f`` as
  287. .. code-block:: python
  288. @ray.remote
  289. def f():
  290. return file.h()
  291. You can redefine ``f`` as follows.
  292. .. code-block:: python
  293. @ray.remote
  294. def f():
  295. reload(file)
  296. return file.h()
  297. This forces the reload to happen on the workers as needed. Note that in
  298. Python 3, you need to do ``from importlib import reload``.