using-ray-with-tensorflow.rst 7.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
  1. Best Practices: Ray with Tensorflow
  2. ===================================
  3. This document describes best practices for using the Ray core APIs with TensorFlow. Ray also provides higher-level utilities for working with Tensorflow, such as distributed training APIs (`training tensorflow example`_), Tune for hyperparameter search (:doc:`/tune/examples/tf_mnist_example`), RLlib for reinforcement learning (`RLlib tensorflow example`_).
  4. .. _`training tensorflow example`: tf_distributed_training.html
  5. .. _`RLlib tensorflow example`: rllib-models.html#tensorflow-models
  6. Feel free to contribute if you think this document is missing anything.
  7. Common Issues: Pickling
  8. -----------------------
  9. One common issue with TensorFlow2.0 is a pickling error like the following:
  10. .. code-block:: none
  11. File "/home/***/venv/lib/python3.6/site-packages/ray/actor.py", line 322, in remote
  12. return self._remote(args=args, kwargs=kwargs)
  13. File "/home/***/venv/lib/python3.6/site-packages/ray/actor.py", line 405, in _remote
  14. self._modified_class, self._actor_method_names)
  15. File "/home/***/venv/lib/python3.6/site-packages/ray/function_manager.py", line 578, in export_actor_class
  16. "class": pickle.dumps(Class),
  17. File "/home/***/venv/lib/python3.6/site-packages/ray/cloudpickle/cloudpickle.py", line 1123, in dumps
  18. cp.dump(obj)
  19. File "/home/***/lib/python3.6/site-packages/ray/cloudpickle/cloudpickle.py", line 482, in dump
  20. return Pickler.dump(self, obj)
  21. File "/usr/lib/python3.6/pickle.py", line 409, in dump
  22. self.save(obj)
  23. File "/usr/lib/python3.6/pickle.py", line 476, in save
  24. f(self, obj) # Call unbound method with explicit self
  25. File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
  26. save(element)
  27. File "/usr/lib/python3.6/pickle.py", line 808, in _batch_appends
  28. save(tmp[0])
  29. File "/usr/lib/python3.6/pickle.py", line 496, in save
  30. rv = reduce(self.proto)
  31. TypeError: can't pickle _LazyLoader objects
  32. To resolve this, you should move all instances of ``import tensorflow`` into the Ray actor or function, as follows:
  33. .. code-block:: python
  34. def create_model():
  35. import tensorflow as tf
  36. ...
  37. This issue is caused by side-effects of importing TensorFlow and setting global state.
  38. Use Actors for Parallel Models
  39. ------------------------------
  40. If you are training a deep network in the distributed setting, you may need to
  41. ship your deep network between processes (or machines). However, shipping the model is not always straightforward.
  42. .. tip::
  43. Avoid sending the Tensorflow model directly. A straightforward attempt to pickle a TensorFlow graph gives mixed results. Furthermore, creating a TensorFlow graph can take tens of seconds, and so serializing a graph and recreating it in another process will be inefficient.
  44. It is recommended to replicate the same TensorFlow graph on each worker once
  45. at the beginning and then to ship only the weights between the workers.
  46. Suppose we have a simple network definition (this one is modified from the
  47. TensorFlow documentation).
  48. .. literalinclude:: ../examples/doc_code/tf_example.py
  49. :language: python
  50. :start-after: __tf_model_start__
  51. :end-before: __tf_model_end__
  52. It is strongly recommended you create actors to handle this. To do this, first initialize
  53. ray and define an Actor class:
  54. .. literalinclude:: ../examples/doc_code/tf_example.py
  55. :language: python
  56. :start-after: __ray_start__
  57. :end-before: __ray_end__
  58. Then, we can instantiate this actor and train it on the separate process:
  59. .. literalinclude:: ../examples/doc_code/tf_example.py
  60. :language: python
  61. :start-after: __actor_start__
  62. :end-before: __actor_end__
  63. We can then use ``set_weights`` and ``get_weights`` to move the weights of the neural network
  64. around. This allows us to manipulate weights between different models running in parallel without shipping the actual TensorFlow graphs, which are much more complex Python objects.
  65. .. literalinclude:: ../examples/doc_code/tf_example.py
  66. :language: python
  67. :start-after: __weight_average_start__
  68. Lower-level TF Utilities
  69. ------------------------
  70. Given a low-level TF definition:
  71. .. code-block:: python
  72. import tensorflow as tf
  73. import numpy as np
  74. x_data = tf.placeholder(tf.float32, shape=[100])
  75. y_data = tf.placeholder(tf.float32, shape=[100])
  76. w = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
  77. b = tf.Variable(tf.zeros([1]))
  78. y = w * x_data + b
  79. loss = tf.reduce_mean(tf.square(y - y_data))
  80. optimizer = tf.train.GradientDescentOptimizer(0.5)
  81. grads = optimizer.compute_gradients(loss)
  82. train = optimizer.apply_gradients(grads)
  83. init = tf.global_variables_initializer()
  84. sess = tf.Session()
  85. To extract the weights and set the weights, you can use the following helper
  86. method.
  87. .. code-block:: python
  88. import ray.experimental.tf_utils
  89. variables = ray.experimental.tf_utils.TensorFlowVariables(loss, sess)
  90. The ``TensorFlowVariables`` object provides methods for getting and setting the
  91. weights as well as collecting all of the variables in the model.
  92. Now we can use these methods to extract the weights, and place them back in the
  93. network as follows.
  94. .. code-block:: python
  95. sess = tf.Session()
  96. # First initialize the weights.
  97. sess.run(init)
  98. # Get the weights
  99. weights = variables.get_weights() # Returns a dictionary of numpy arrays
  100. # Set the weights
  101. variables.set_weights(weights)
  102. **Note:** If we were to set the weights using the ``assign`` method like below,
  103. each call to ``assign`` would add a node to the graph, and the graph would grow
  104. unmanageably large over time.
  105. .. code-block:: python
  106. w.assign(np.zeros(1)) # This adds a node to the graph every time you call it.
  107. b.assign(np.zeros(1)) # This adds a node to the graph every time you call it.
  108. .. autoclass:: ray.experimental.tf_utils.TensorFlowVariables
  109. :members:
  110. .. note:: This may not work with `tf.Keras`.
  111. Troubleshooting
  112. ~~~~~~~~~~~~~~~
  113. Note that ``TensorFlowVariables`` uses variable names to determine what
  114. variables to set when calling ``set_weights``. One common issue arises when two
  115. networks are defined in the same TensorFlow graph. In this case, TensorFlow
  116. appends an underscore and integer to the names of variables to disambiguate
  117. them. This will cause ``TensorFlowVariables`` to fail. For example, if we have a
  118. class definiton ``Network`` with a ``TensorFlowVariables`` instance:
  119. .. code-block:: python
  120. import ray
  121. import tensorflow as tf
  122. class Network(object):
  123. def __init__(self):
  124. a = tf.Variable(1)
  125. b = tf.Variable(1)
  126. c = tf.add(a, b)
  127. sess = tf.Session()
  128. init = tf.global_variables_initializer()
  129. sess.run(init)
  130. self.variables = ray.experimental.tf_utils.TensorFlowVariables(c, sess)
  131. def set_weights(self, weights):
  132. self.variables.set_weights(weights)
  133. def get_weights(self):
  134. return self.variables.get_weights()
  135. and run the following code:
  136. .. code-block:: python
  137. a = Network()
  138. b = Network()
  139. b.set_weights(a.get_weights())
  140. the code would fail. If we instead defined each network in its own TensorFlow
  141. graph, then it would work:
  142. .. code-block:: python
  143. with tf.Graph().as_default():
  144. a = Network()
  145. with tf.Graph().as_default():
  146. b = Network()
  147. b.set_weights(a.get_weights())
  148. This issue does not occur between actors that contain a network, as each actor
  149. is in its own process, and thus is in its own graph. This also does not occur
  150. when using ``set_flat``.
  151. Another issue to keep in mind is that ``TensorFlowVariables`` needs to add new
  152. operations to the graph. If you close the graph and make it immutable, e.g.
  153. creating a ``MonitoredTrainingSession`` the initialization will fail. To resolve
  154. this, simply create the instance before you close the graph.