scalability.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
  1. .. _tune-scalability:
  2. Scalability and overhead benchmarks
  3. ===================================
  4. We conducted a series of microbenchmarks where we evaluated the scalability of Ray Tune and analyzed the
  5. performance overhead we observed. The results from these benchmarks are reflected in the documentation,
  6. e.g. when we make suggestions on :ref:`how to remove performance bottlenecks <tune-bottlenecks>`.
  7. This page gives an overview over the experiments we did. For each of these experiments, the goal was to
  8. examine the total runtime of the experiment and address issues when the observed overhead compared to the
  9. minimal theoretical time was too high (e.g. more than 20% overhead).
  10. In some of the experiments we tweaked the default settings for maximum throughput, e.g. by disabling
  11. trial synchronization or result logging. If this is the case, this is stated in the respective benchmark
  12. description.
  13. .. list-table:: Ray Tune scalability benchmarks overview
  14. :header-rows: 1
  15. * - Variable
  16. - # of trials
  17. - Results/second /trial
  18. - # of nodes
  19. - # CPUs/node
  20. - Trial length (s)
  21. - Observed runtime
  22. * - `Trial bookkeeping /scheduling overhead <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py>`_
  23. - 10,000
  24. - 1
  25. - 1
  26. - 16
  27. - 1
  28. - | 715.27
  29. | (625 minimum)
  30. * - `Result throughput (many trials) <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_result_throughput_cluster.py>`_
  31. - 1,000
  32. - 0.1
  33. - 16
  34. - 64
  35. - 100
  36. - 168.18
  37. * - `Result throughput (many results) <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_result_throughput_single_node.py>`_
  38. - 96
  39. - 10
  40. - 1
  41. - 96
  42. - 100
  43. - 168.94
  44. * - `Network communication overhead <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_network_overhead.py>`_
  45. - 200
  46. - 1
  47. - 200
  48. - 2
  49. - 300
  50. - 2280.82
  51. * - `Long running, 3.75 GB checkpoints <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_long_running_large_checkpoints.py>`_
  52. - 16
  53. - | Results: 1/60
  54. | Checkpoint: 1/900
  55. - 1
  56. - 16
  57. - 86,400
  58. - 88687.41
  59. * - `XGBoost parameter sweep <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_xgboost_sweep.py>`_
  60. - 16
  61. - ?
  62. - 16
  63. - 64
  64. - ?
  65. - 3903
  66. * - `Durable trainable <https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_durable_trainable.py>`_
  67. - 16
  68. - | 10/60
  69. | with 10MB CP
  70. - 16
  71. - 2
  72. - 300
  73. - 392.42
  74. Below we discuss some insights on results where we observed much overhead.
  75. Result throughput
  76. -----------------
  77. Result throughput describes the number of results Ray Tune can process in a given timeframe (e.g.
  78. "results per second").
  79. The higher the throughput, the more concurrent results can be processed without major delays.
  80. Result throughput is limited by the time it takes to process results. When a trial reports results, it only
  81. continues training once the trial executor re-triggered the remote training function. If many trials report
  82. results at the same time, each subsequent remote training call is only triggered after handling that trial's
  83. results.
  84. To speed the process up, Ray Tune adaptively buffers results, so that trial training is continued earlier if
  85. many trials are running in parallel and report many results at the same time. Still, processing hundreds of
  86. results per trial for dozens or hundreds of trials can become a bottleneck.
  87. **Main insight**: Ray Tune will throw a warning when trial processing becomes a bottleneck. If you notice
  88. that this becomes a problem, please follow our guidelines outlined :ref:`in the FAQ <tune-bottlenecks>`.
  89. Generally, it is advised to not report too many results at the same time. Consider increasing the report
  90. intervals by a factor of 5-10x.
  91. Below we present more detailed results on the result throughput performance.
  92. Many concurrent trials
  93. """"""""""""""""""""""
  94. In this setup, loggers (CSV, JSON, and TensorboardX) and trial synchronization are disabled, except when
  95. explicitly noted.
  96. In this experiment, we're running many concurrent trials (up to 1,000) on a cluster. We then adjust the
  97. reporting frequency (number of results per second) of the trials to measure the throughput limits.
  98. It seems that around 500 total results/second seem to be the threshold for acceptable performance
  99. when logging and synchronization are disabled. With logging enabled, around 50-100 results per second
  100. can still be managed without too much overhead, but after that measures to decrease incoming results
  101. should be considered.
  102. +-------------+--------------------------+---------+---------------+------------------+---------+
  103. | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial. | Current |
  104. +=============+==========================+=========+===============+==================+=========+
  105. | 1,000 | 10 | 16 | 64 | 100s | 248.39 |
  106. +-------------+--------------------------+---------+---------------+------------------+---------+
  107. | 1,000 | 1 | 16 | 64 | 100s | 175.00 |
  108. +-------------+--------------------------+---------+---------------+------------------+---------+
  109. | 1,000 | 0.1 with logging | 16 | 64 | 100s | 168.18 |
  110. +-------------+--------------------------+---------+---------------+------------------+---------+
  111. | 384 | 10 | 16 | 64 | 100s | 125.17 |
  112. +-------------+--------------------------+---------+---------------+------------------+---------+
  113. | 256 | 50 | 16 | 64 | 100s | 307.02 |
  114. +-------------+--------------------------+---------+---------------+------------------+---------+
  115. | 256 | 20 | 16 | 64 | 100s | 146.20 |
  116. +-------------+--------------------------+---------+---------------+------------------+---------+
  117. | 256 | 10 | 16 | 64 | 100s | 113.40 |
  118. +-------------+--------------------------+---------+---------------+------------------+---------+
  119. | 256 | 10 with logging | 16 | 64 | 100s | 436.12 |
  120. +-------------+--------------------------+---------+---------------+------------------+---------+
  121. | 256 | 0.1 with logging | 16 | 64 | 100s | 106.75 |
  122. +-------------+--------------------------+---------+---------------+------------------+---------+
  123. Many results on a single node
  124. """""""""""""""""""""""""""""
  125. In this setup, loggers (CSV, JSON, and TensorboardX) are disabled, except when
  126. explicitly noted.
  127. In this experiment, we're running 96 concurrent trials on a single node. We then adjust the
  128. reporting frequency (number of results per second) of the trials to find the throughput limits.
  129. Compared to the cluster experiment setup, we report much more often, as we're running less total trials in parallel.
  130. On a single node, throughput seems to be a bit higher. With logging, handling 1000 results per second
  131. seems acceptable in terms of overhead, though you should probably still target for a lower number.
  132. +-------------+--------------------------+---------+---------------+------------------+---------+
  133. | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial. | Current |
  134. +=============+==========================+=========+===============+==================+=========+
  135. | 96 | 500 | 1 | 96 | 100s | 959.32 |
  136. +-------------+--------------------------+---------+---------------+------------------+---------+
  137. | 96 | 100 | 1 | 96 | 100s | 219.48 |
  138. +-------------+--------------------------+---------+---------------+------------------+---------+
  139. | 96 | 80 | 1 | 96 | 100s | 197.15 |
  140. +-------------+--------------------------+---------+---------------+------------------+---------+
  141. | 96 | 50 | 1 | 96 | 100s | 110.55 |
  142. +-------------+--------------------------+---------+---------------+------------------+---------+
  143. | 96 | 50 with logging | 1 | 96 | 100s | 702.64 |
  144. +-------------+--------------------------+---------+---------------+------------------+---------+
  145. | 96 | 10 | 1 | 96 | 100s | 103.51 |
  146. +-------------+--------------------------+---------+---------------+------------------+---------+
  147. | 96 | 10 with logging | 1 | 96 | 100s | 168.94 |
  148. +-------------+--------------------------+---------+---------------+------------------+---------+
  149. Network overhead
  150. ----------------
  151. Running Ray Tune on a distributed setup leads to network communication overhead. This is mostly due to
  152. trial synchronization, where results and checkpoints are periodically synchronized and sent via the network.
  153. Per default this happens via SSH, where connnection initialization can take between 1 and 2 seconds each time.
  154. Since this is a blocking operation that happens on a per-trial basis, running many concurrent trials
  155. quickly becomes bottlenecked by this synchronization.
  156. In this experiment, we ran a number of trials on a cluster. Each trial was run on a separate node. We
  157. varied the number of concurrent trials (and nodes) to see how much network communication affects
  158. total runtime.
  159. **Main insight**: When running many concurrent trials in a distributed setup, consider using a
  160. :ref:`ray.tune.durable <tune-durable-trainable>` for checkpoint synchronization instead. Another option would
  161. be to use a shared storage and disable syncing to driver. The best practices are described
  162. :ref:`here for Kubernetes setups <tune-kubernetes>` but is applicable for any kind of setup.
  163. In the table below we present more detailed results on the network communication overhead.
  164. +-------------+--------------------------+---------+---------------+------------------+---------+
  165. | # of trials | Results / second / trial | # Nodes | # CPUs / Node | Length of trial | Current |
  166. +=============+==========================+=========+===============+==================+=========+
  167. | 200 | 1 | 200 | 2 | 300s | 2280.82 |
  168. +-------------+--------------------------+---------+---------------+------------------+---------+
  169. | 100 | 1 | 100 | 2 | 300s | 1470 |
  170. +-------------+--------------------------+---------+---------------+------------------+---------+
  171. | 100 | 0.01 | 100 | 2 | 300s | 473.41 |
  172. +-------------+--------------------------+---------+---------------+------------------+---------+
  173. | 50 | 1 | 50 | 2 | 300s | 474.30 |
  174. +-------------+--------------------------+---------+---------------+------------------+---------+
  175. | 50 | 0.1 | 50 | 2 | 300s | 441.54 |
  176. +-------------+--------------------------+---------+---------------+------------------+---------+
  177. | 10 | 1 | 10 | 2 | 300s | 334.37 |
  178. +-------------+--------------------------+---------+---------------+------------------+---------+