cloud.rst 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458
  1. .. _cluster-cloud:
  2. Launching Cloud Clusters
  3. ========================
  4. This section provides instructions for configuring the Ray Cluster Launcher to use with AWS/Azure/GCP, an existing Kubernetes cluster, or on a private cluster of host machines.
  5. See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
  6. .. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
  7. .. _ref-cloud-setup:
  8. AWS/GCP/Azure
  9. -------------
  10. .. toctree::
  11. :hidden:
  12. /cluster/aws-tips.rst
  13. .. tabs::
  14. .. group-tab:: AWS
  15. First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
  16. as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
  17. Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
  18. Test that it works by running the following commands from your local machine:
  19. .. code-block:: bash
  20. # Create or update the cluster. When the command finishes, it will print
  21. # out the command that can be used to SSH into the cluster head node.
  22. $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
  23. # Get a remote screen on the head node.
  24. $ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
  25. $ # Try running a Ray program with 'ray.init(address="auto")'.
  26. # Tear down the cluster.
  27. $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
  28. See :ref:`aws-cluster` for recipes on customizing AWS clusters.
  29. .. group-tab:: Azure
  30. First, install the Azure CLI (``pip install azure-cli``) then login using (``az login``).
  31. Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
  32. Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
  33. Test that it works by running the following commands from your local machine:
  34. .. code-block:: bash
  35. # Create or update the cluster. When the command finishes, it will print
  36. # out the command that can be used to SSH into the cluster head node.
  37. $ ray up ray/python/ray/autoscaler/azure/example-full.yaml
  38. # Get a remote screen on the head node.
  39. $ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
  40. # test ray setup
  41. $ python -c 'import ray; ray.init(address="auto")'
  42. $ exit
  43. # Tear down the cluster.
  44. $ ray down ray/python/ray/autoscaler/azure/example-full.yaml
  45. **Azure Portal**:
  46. Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
  47. the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
  48. for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
  49. The head node conveniently exposes both SSH as well as JupyterLab.
  50. .. image:: https://aka.ms/deploytoazurebutton
  51. :target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
  52. :alt: Deploy to Azure
  53. Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
  54. Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py37_tensorflow by default) to connect to the Ray cluster.
  55. .. code-block:: python
  56. import ray
  57. ray.init(address='auto')
  58. Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
  59. 1. Activates one of the conda environments available on DSVM
  60. 2. Installs Ray and any other user-specified dependencies
  61. 3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
  62. .. group-tab:: GCP
  63. First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
  64. Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
  65. Test that it works by running the following commands from your local machine:
  66. .. code-block:: bash
  67. # Create or update the cluster. When the command finishes, it will print
  68. # out the command that can be used to SSH into the cluster head node.
  69. $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
  70. # Get a remote screen on the head node.
  71. $ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
  72. $ # Try running a Ray program with 'ray.init(address="auto")'.
  73. # Tear down the cluster.
  74. $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
  75. .. group-tab:: Custom
  76. Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
  77. You can specify the external node provider using the yaml config:
  78. .. code-block:: yaml
  79. provider:
  80. type: external
  81. module: mypackage.myclass
  82. The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
  83. .. _ray-launch-k8s:
  84. Kubernetes
  85. ----------
  86. The cluster launcher can also be used to start Ray clusters on an existing Kubernetes cluster.
  87. .. tabs::
  88. .. group-tab:: Kubernetes
  89. First, install the Kubernetes API client (``pip install kubernetes``), then make sure your Kubernetes credentials are set up properly to access the cluster (if a command like ``kubectl get pods`` succeeds, you should be good to go).
  90. Once you have ``kubectl`` configured locally to access the remote cluster, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/kubernetes/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-full.yaml>`__ cluster config file will create a small cluster of one pod for the head node configured to autoscale up to two worker node pods, with all pods requiring 1 CPU and 0.5GiB of memory.
  91. It's also possible to deploy service and ingress resources for each scaled worker pod. An example is provided in `ray/python/ray/autoscaler/kubernetes/example-ingress.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-ingress.yaml>`__.
  92. Test that it works by running the following commands from your local machine:
  93. .. code-block:: bash
  94. # Create or update the cluster. When the command finishes, it will print
  95. # out the command that can be used to get a remote shell into the head node.
  96. $ ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
  97. # List the pods running in the cluster. You shoud only see one head node
  98. # until you start running an application, at which point worker nodes
  99. # should be started. Don't forget to include the Ray namespace in your
  100. # 'kubectl' commands ('ray' by default).
  101. $ kubectl -n ray get pods
  102. # Get a remote screen on the head node.
  103. $ ray attach ray/python/ray/autoscaler/kubernetes/example-full.yaml
  104. $ # Try running a Ray program with 'ray.init(address="auto")'.
  105. # Tear down the cluster
  106. $ ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml
  107. .. tip:: This section describes the easiest way to launch a Ray cluster on Kubernetes. See this :ref:`document for advanced usage <ray-k8s-deploy>` of Kubernetes with Ray.
  108. .. tip:: If you would like to use Ray Tune in your Kubernetes cluster, have a look at :ref:`this short guide to make it work <tune-kubernetes>`.
  109. .. group-tab:: Staroid (contributed)
  110. First, install the staroid client package (``pip install staroid``) then get `access token <https://staroid.com/settings/accesstokens>`_.
  111. Once you have an access token, you should be ready to launch your cluster.
  112. The provided `ray/python/ray/autoscaler/staroid/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/staroid/example-full.yaml>`__ cluster config file will create a cluster with
  113. - a Jupyter notebook running on head node.
  114. (Staroid management console -> Kubernetes -> ``<your_ske_name>`` -> ``<ray_cluster_name>`` -> Click "notebook")
  115. - a shared nfs volume across all ray nodes mounted under ``/nfs`` directory.
  116. Test that it works by running the following commands from your local machine:
  117. .. code-block:: bash
  118. # Configure access token through environment variable.
  119. $ export STAROID_ACCESS_TOKEN=<your access token>
  120. # Create or update the cluster. When the command finishes,
  121. # you can attach a screen to the head node.
  122. $ ray up ray/python/ray/autoscaler/staroid/example-full.yaml
  123. # Get a remote screen on the head node.
  124. $ ray attach ray/python/ray/autoscaler/staroid/example-full.yaml
  125. $ # Try running a Ray program with 'ray.init(address="auto")'.
  126. # Tear down the cluster
  127. $ ray down ray/python/ray/autoscaler/staroid/example-full.yaml
  128. .. _cluster-private-setup:
  129. Local On Premise Cluster (List of nodes)
  130. ----------------------------------------
  131. You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.
  132. The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.
  133. There are two ways of running private clusters:
  134. - Manually managed, i.e., the user explicitly specifies the head and worker ips.
  135. - Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.
  136. .. tip:: To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:
  137. .. code-block:: bash
  138. $ ssh-keygen
  139. $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  140. .. tabs::
  141. .. group-tab:: Manually Managed
  142. You can get started by filling out the fields in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
  143. Be sure to specify the proper ``head_ip``, list of ``worker_ips``, and the ``ssh_user`` field.
  144. Test that it works by running the following commands from your local machine:
  145. .. code-block:: bash
  146. # Create or update the cluster. When the command finishes, it will print
  147. # out the command that can be used to get a remote shell into the head node.
  148. $ ray up ray/python/ray/autoscaler/local/example-full.yaml
  149. # Get a remote screen on the head node.
  150. $ ray attach ray/python/ray/autoscaler/local/example-full.yaml
  151. $ # Try running a Ray program with 'ray.init(address="auto")'.
  152. # Tear down the cluster
  153. $ ray down ray/python/ray/autoscaler/local/example-full.yaml
  154. .. group-tab:: Automatically Managed
  155. Start by launching the coordinator server that will manage all the on prem clusters. This server also makes sure to isolate the resources between different users. The script for running the coordinator server is `ray/python/ray/autoscaler/local/coordinator_server.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/coordinator_server.py>`__. To launch the coordinator server run:
  156. .. code-block:: bash
  157. $ python coordinator_server.py --ips <list_of_node_ips> --port <PORT>
  158. where ``list_of_node_ips`` is a comma separated list of all the available nodes on the private cluster. For example, ``160.24.42.48,160.24.42.49,...`` and ``<PORT>`` is the port that the coordinator server will listen on.
  159. After running the coordinator server it will print the address of the coordinator server. For example:
  160. .. code-block:: bash
  161. >> INFO:ray.autoscaler.local.coordinator_server:Running on prem coordinator server
  162. on address <Host:PORT>
  163. Next, the user only specifies the ``<Host:PORT>`` printed above in the ``coordinator_address`` entry instead of specific head/worker ips in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
  164. Now we can test that it works by running the following commands from your local machine:
  165. .. code-block:: bash
  166. # Create or update the cluster. When the command finishes, it will print
  167. # out the command that can be used to get a remote shell into the head node.
  168. $ ray up ray/python/ray/autoscaler/local/example-full.yaml
  169. # Get a remote screen on the head node.
  170. $ ray attach ray/python/ray/autoscaler/local/example-full.yaml
  171. $ # Try running a Ray program with 'ray.init(address="auto")'.
  172. # Tear down the cluster
  173. $ ray down ray/python/ray/autoscaler/local/example-full.yaml
  174. .. _manual-cluster:
  175. Manual Ray Cluster Setup
  176. ------------------------
  177. The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
  178. This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
  179. on each machine. To install Ray, follow the `installation instructions`_.
  180. .. _`installation instructions`: http://docs.ray.io/en/master/installation.html
  181. Starting Ray on each machine
  182. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  183. On the head node (just choose some node to be the head node), run the following.
  184. If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
  185. random port.
  186. .. code-block:: bash
  187. $ ray start --head --port=6379
  188. ...
  189. Next steps
  190. To connect to this Ray runtime from another node, run
  191. ray start --address='<ip address>:6379' --redis-password='<password>'
  192. If connection fails, check your firewall settings and network configuration.
  193. The command will print out the address of the Redis server that was started
  194. (the local node IP address plus the port number you specified).
  195. **Then on each of the other nodes**, run the following. Make sure to replace
  196. ``<address>`` with the value printed by the command on the head node (it
  197. should look something like ``123.45.67.89:6379``).
  198. Note that if your compute nodes are on their own subnetwork with Network
  199. Address Translation, to connect from a regular machine outside that subnetwork,
  200. the command printed by the head node will not work. You need to find the
  201. address that will reach the head node from the second machine. If the head node
  202. has a domain address like compute04.berkeley.edu, you can simply use that in
  203. place of an IP address and rely on the DNS.
  204. .. code-block:: bash
  205. $ ray start --address=<address> --redis-password='<password>'
  206. --------------------
  207. Ray runtime started.
  208. --------------------
  209. To terminate the Ray runtime, run
  210. ray stop
  211. If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
  212. with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
  213. If you see ``Unable to connect to Redis. If the Redis instance is on a
  214. different machine, check that your firewall is configured properly.``,
  215. this means the ``--port`` is inaccessible at the given IP address (because, for
  216. example, the head node is not actually running Ray, or you have the wrong IP
  217. address).
  218. If you see ``Ray runtime started.``, then the node successfully connected to
  219. the IP address at the ``--port``. You should now be able to connect to the
  220. cluster with ``ray.init(address='auto')``.
  221. If ``ray.init(address='auto')`` keeps repeating
  222. ``redis_context.cc:303: Failed to connect to Redis, retrying.``, then the node
  223. is failing to connect to some other port(s) besides the main port.
  224. .. code-block:: bash
  225. If connection fails, check your firewall settings and network configuration.
  226. If the connection fails, to check whether each port can be reached from a node,
  227. you can use a tool such as ``nmap`` or ``nc``.
  228. .. code-block:: bash
  229. $ nmap -sV --reason -p $PORT $HEAD_ADDRESS
  230. Nmap scan report for compute04.berkeley.edu (123.456.78.910)
  231. Host is up, received echo-reply ttl 60 (0.00087s latency).
  232. rDNS record for 123.456.78.910: compute04.berkeley.edu
  233. PORT STATE SERVICE REASON VERSION
  234. 6379/tcp open redis syn-ack ttl 60 Redis key-value store
  235. Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
  236. $ nc -vv -z $HEAD_ADDRESS $PORT
  237. Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
  238. If the node cannot access that port at that IP address, you might see
  239. .. code-block:: bash
  240. $ nmap -sV --reason -p $PORT $HEAD_ADDRESS
  241. Nmap scan report for compute04.berkeley.edu (123.456.78.910)
  242. Host is up (0.0011s latency).
  243. rDNS record for 123.456.78.910: compute04.berkeley.edu
  244. PORT STATE SERVICE REASON VERSION
  245. 6379/tcp closed redis reset ttl 60
  246. Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
  247. $ nc -vv -z $HEAD_ADDRESS $PORT
  248. nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
  249. Stopping Ray
  250. ~~~~~~~~~~~~
  251. When you want to stop the Ray processes, run ``ray stop`` on each node.
  252. Additional Cloud Providers
  253. --------------------------
  254. To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
  255. Security
  256. --------
  257. On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
  258. .. _using-ray-on-a-cluster:
  259. Running a Ray program on the Ray cluster
  260. ----------------------------------------
  261. To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
  262. .. tabs::
  263. .. group-tab:: Python
  264. Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example:
  265. .. code-block:: python
  266. ray.init(address="auto")
  267. .. group-tab:: Java
  268. You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
  269. To connect your program to the Ray cluster, run it like this:
  270. .. code-block:: bash
  271. java -classpath <classpath> \
  272. -Dray.address=<address> \
  273. <classname> <args>
  274. .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
  275. .. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
  276. To verify that the correct number of nodes have joined the cluster, you can run the following.
  277. .. code-block:: python
  278. import time
  279. @ray.remote
  280. def f():
  281. time.sleep(0.01)
  282. return ray._private.services.get_node_ip_address()
  283. # Get a list of the IP addresses of the nodes that have joined the cluster.
  284. set(ray.get([f.remote() for _ in range(1000)]))
  285. What's Next?
  286. -------------
  287. Now that you have a working understanding of the cluster launcher, check out:
  288. * :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
  289. * :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
  290. * :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
  291. Questions or Issues?
  292. --------------------
  293. .. include:: /_help.rst