kubernetes.rst 29 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617
  1. ***********************
  2. Deploying on Kubernetes
  3. ***********************
  4. .. _ray-k8s-deploy:
  5. Introduction
  6. ============
  7. You can leverage your Kubernetes cluster as a substrate for execution of distributed Ray programs.
  8. The Ray Autoscaler spins up and deletes Kubernetes pods according to resource demands of the Ray workload - each Ray node runs in its own Kubernetes pod.
  9. Quick Guide
  10. -----------
  11. This document covers the following topics:
  12. - :ref:`Overview of methods for launching a Ray Cluster on Kubernetes<k8s-overview>`
  13. - :ref:`Managing clusters with the Ray Cluster Launcher<k8s-cluster-launcher>`
  14. - :ref:`Managing clusters with the Ray Kubernetes Operator<k8s-operator>`
  15. - :ref:`Interacting with a Ray Cluster via a Kubernetes Service<ray-k8s-interact>`
  16. - :ref:`Comparison of the Ray Cluster Launcher and Ray Kubernetes Operator<k8s-comparison>`
  17. You can find more information at the following links:
  18. - :ref:`GPU usage with Kubernetes<k8s-gpus>`
  19. - :ref:`Using Ray Tune on your Kubernetes cluster<tune-kubernetes>`
  20. - :ref:`How to manually set up a non-autoscaling Ray cluster on Kubernetes<ray-k8s-static>`
  21. .. _k8s-overview:
  22. Ray on Kubernetes
  23. =================
  24. Ray supports two ways of launching an autoscaling Ray cluster on Kubernetes.
  25. - Using the :ref:`Ray Cluster Launcher <k8s-cluster-launcher>`
  26. - Using the :ref:`Ray Kubernetes Operator <k8s-operator>`
  27. The Cluster Launcher and Ray Kubernetes Operator provide similar functionality; each serves as an `interface to the Ray autoscaler`.
  28. Below is a brief overview of the two tools.
  29. The Ray Cluster Launcher
  30. ------------------------
  31. The :ref:`Ray Cluster Launcher <cluster-cloud>` is geared towards experimentation and development and can be used to launch Ray clusters on Kubernetes (among other backends).
  32. It allows you to manage an autoscaling Ray Cluster from your local environment using the :ref:`Ray CLI <cluster-commands>`.
  33. For example, you can use ``ray up`` to launch a Ray cluster on Kubernetes and ``ray exec`` to execute commands in the Ray head node's pod.
  34. Note that using the Cluster Launcher requires Ray to be :ref:`installed locally <installation>`.
  35. * Get started with the :ref:`Ray Cluster Launcher on Kubernetes<k8s-cluster-launcher>`.
  36. The Ray Kubernetes Operator
  37. ---------------------------
  38. The Ray Kubernetes Operator is a Kubernetes-native solution geared towards production use cases.
  39. Rather than handling cluster launching locally, cluster launching and autoscaling are centralized in the Operator's Pod.
  40. The Operator follows the standard Kubernetes `pattern <https://kubernetes.io/docs/concepts/extend-kubernetes/operator/>`__ - it runs
  41. a control loop which manages a `Kubernetes Custom Resource`_ specifying the desired state of your Ray cluster.
  42. Using the Kubernetes Operator does not require a local installation of Ray - all interactions with your Ray cluster are mediated by Kubernetes.
  43. * Get started with the :ref:`Ray Kubernetes Operator<k8s-operator>`.
  44. Further reading
  45. ---------------
  46. Read :ref:`here<k8s-comparison>` for more details on the comparison between the Operator and Cluster Launcher.
  47. Note that it is also possible to manually deploy a :ref:`non-autoscaling Ray cluster <ray-k8s-static>` on Kubernetes.
  48. .. note::
  49. The configuration ``yaml`` files used in this document are provided in the `Ray repository`_
  50. as examples to get you started. When deploying real applications, you will probably
  51. want to build and use your own container images, add more worker nodes to the
  52. cluster, and change the resource requests for the head and worker nodes. Refer to the provided ``yaml``
  53. files to be sure that you maintain important configuration options for Ray to
  54. function properly.
  55. .. _`Ray repository`: https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes
  56. .. _k8s-cluster-launcher:
  57. Managing Clusters with the Ray Cluster Launcher
  58. ===============================================
  59. This section briefly explains how to use the Ray Cluster Launcher to launch a Ray cluster on your existing Kubernetes cluster.
  60. First, install the Kubernetes API client (``pip install kubernetes``), then make sure your Kubernetes credentials are set up properly to access the cluster (if a command like ``kubectl get pods`` succeeds, you should be good to go).
  61. Once you have ``kubectl`` configured locally to access the remote cluster, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/kubernetes/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-full.yaml>`__ cluster config file will create a small cluster of one pod for the head node configured to autoscale up to two worker node pods, with all pods requiring 1 CPU and 0.5GiB of memory.
  62. Test that it works by running the following commands from your local machine:
  63. .. _cluster-launcher-commands:
  64. .. code-block:: bash
  65. # Create or update the cluster. When the command finishes, it will print
  66. # out the command that can be used to get a remote shell into the head node.
  67. $ ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
  68. # List the pods running in the cluster. You shoud only see one head node
  69. # until you start running an application, at which point worker nodes
  70. # should be started. Don't forget to include the Ray namespace in your
  71. # 'kubectl' commands ('ray' by default).
  72. $ kubectl -n ray get pods
  73. # Get a remote screen on the head node.
  74. $ ray attach ray/python/ray/autoscaler/kubernetes/example-full.yaml
  75. $ # Try running a Ray program with 'ray.init(address="auto")'.
  76. # View monitor logs
  77. $ ray monitor ray/python/ray/autoscaler/kubernetes/example-full.yaml
  78. # Tear down the cluster
  79. $ ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml
  80. * Learn about :ref:`running Ray programs on Kubernetes <ray-k8s-run>`
  81. .. _k8s-operator:
  82. Managing clusters with the Ray Kubernetes Operator
  83. ==================================================
  84. .. role:: bash(code)
  85. :language: bash
  86. This section explains how to use the Ray Kubernetes Operator to launch a Ray cluster on your existing Kubernetes cluster.
  87. The example commands in this document launch six Kubernetes pods, using a total of 6 CPU and 3.5Gi memory.
  88. If you are experimenting using a test Kubernetes environment such as `minikube`_, make sure to provision sufficient resources, e.g.
  89. :bash:`minikube start --cpus=6 --memory=\"4G\"`.
  90. Alternatively, reduce resource usage by editing the ``yaml`` files referenced in this document; for example, reduce ``minWorkers``
  91. in ``example_cluster.yaml`` and ``example_cluster2.yaml``.
  92. .. note::
  93. 1. The Ray Kubernetes Operator is still experimental. For the yaml files in the examples below, we recommend using the latest master version of Ray.
  94. 2. The Ray Kubernetes Operator requires Kubernetes version at least ``v1.17.0``. Check Kubernetes version info with the command :bash:`kubectl version`.
  95. Applying the RayCluster Custom Resource Definition
  96. --------------------------------------------------
  97. The Ray Kubernetes operator works by managing a user-submitted `Kubernetes Custom Resource`_ (CR) called a ``RayCluster``.
  98. A RayCluster custom resource describes the desired state of the Ray cluster.
  99. To get started, we need to apply the `Kubernetes Custom Resource Definition`_ (CRD) defining a RayCluster.
  100. .. code-block:: shell
  101. $ kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
  102. customresourcedefinition.apiextensions.k8s.io/rayclusters.cluster.ray.io created
  103. .. note::
  104. The file ``cluster_crd.yaml`` defining the CRD is not meant to meant to be modified by the user. Rather, users :ref:`configure <operator-launch>` a RayCluster CR via a file like `ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml>`__.
  105. The Kubernetes API server then validates the user-submitted RayCluster resource against the CRD.
  106. Picking a Kubernetes Namespace
  107. -------------------------------
  108. The rest of the Kubernetes resources we will use are `namespaced`_.
  109. You can use an existing namespace for your Ray clusters or create a new one if you have permissions.
  110. For this example, we will create a namespace called ``ray``.
  111. .. code-block:: shell
  112. $ kubectl create namespace ray
  113. namespace/ray created
  114. Starting the Operator
  115. ----------------------
  116. To launch the operator in our namespace, we execute the following command.
  117. .. code-block:: shell
  118. $ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
  119. serviceaccount/ray-operator-serviceaccount created
  120. role.rbac.authorization.k8s.io/ray-operator-role created
  121. rolebinding.rbac.authorization.k8s.io/ray-operator-rolebinding created
  122. pod/ray-operator-pod created
  123. The output shows that we've launched a Pod named ``ray-operator-pod``. This is the pod that runs the operator process.
  124. The ServiceAccount, Role, and RoleBinding we have created grant the operator pod the `permissions`_ it needs to manage Ray clusters.
  125. .. _operator-launch:
  126. Launching Ray Clusters
  127. ----------------------
  128. Finally, to launch a Ray cluster, we create a RayCluster custom resource.
  129. .. code-block:: shell
  130. $ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml
  131. raycluster.cluster.ray.io/example-cluster created
  132. The operator detects the RayCluster resource we've created and launches an autoscaling Ray cluster.
  133. Our RayCluster configuration specifies ``minWorkers:2`` in the second entry of ``spec.podTypes``, so we get a head node and two workers upon launch.
  134. .. note::
  135. For more details about RayCluster resources, we recommend take a looking at the annotated example `example_cluster.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml>`__ applied in the last command.
  136. .. code-block:: shell
  137. $ kubectl -n ray get pods
  138. NAME READY STATUS RESTARTS AGE
  139. example-cluster-ray-head-hbxvv 1/1 Running 0 72s
  140. example-cluster-ray-worker-4hvv6 1/1 Running 0 64s
  141. example-cluster-ray-worker-78kp5 1/1 Running 0 64s
  142. ray-operator-pod 1/1 Running 0 2m33s
  143. We see four pods: the operator, the Ray head node, and two Ray worker nodes.
  144. Let's launch another cluster in the same namespace, this one specifiying ``minWorkers:1``.
  145. .. code-block:: shell
  146. $ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
  147. We confirm that both clusters are running in our namespace.
  148. .. code-block:: shell
  149. $ kubectl -n ray get rayclusters
  150. NAME STATUS AGE
  151. example-cluster Running 19s
  152. example-cluster2 Running 19s
  153. $ kubectl -n ray get pods
  154. NAME READY STATUS RESTARTS AGE
  155. example-cluster-ray-head-th4wv 1/1 Running 0 10m
  156. example-cluster-ray-worker-q9pjn 1/1 Running 0 10m
  157. example-cluster-ray-worker-qltnp 1/1 Running 0 10m
  158. example-cluster2-ray-head-kj5mg 1/1 Running 0 10s
  159. example-cluster2-ray-worker-qsgnd 1/1 Running 0 1s
  160. ray-operator-pod 1/1 Running 0 10m
  161. Now we can :ref:`run Ray programs<ray-k8s-run>` on our Ray clusters.
  162. .. _operator-logs:
  163. Monitoring
  164. ----------
  165. Autoscaling logs are written to the operator pod's ``stdout`` and can be accessed with :code:`kubectl logs`.
  166. Each line of output is prefixed by the name of the cluster followed by a colon.
  167. The following command gets the last hundred lines of autoscaling logs for our second cluster.
  168. .. code-block:: shell
  169. $ kubectl -n ray logs ray-operator-pod | grep ^example-cluster2: | tail -n 100
  170. The output should include monitoring updates that look like this:
  171. .. code-block:: shell
  172. example-cluster2:2020-12-12 13:55:36,814 DEBUG autoscaler.py:693 -- Cluster status: 1 nodes
  173. example-cluster2: - MostDelayedHeartbeats: {'172.17.0.4': 0.04093289375305176, '172.17.0.5': 0.04084634780883789}
  174. example-cluster2: - NodeIdleSeconds: Min=36 Mean=38 Max=41
  175. example-cluster2: - ResourceUsage: 0.0/2.0 CPU, 0.0/1.0 Custom1, 0.0/1.0 is_spot, 0.0 GiB/0.58 GiB memory, 0.0 GiB/0.1 GiB object_store_memory
  176. example-cluster2: - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
  177. example-cluster2:Worker node types:
  178. example-cluster2: - worker-nodes: 1
  179. example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:148 -- Cluster resources: [{'object_store_memory': 1.0, 'node:172.17.0.4': 1.0, 'memory': 5.0, 'CPU': 1.0}, {'object_store_memory': 1.0, 'is_spot': 1.0, 'memory': 6.0, 'node:172.17.0.5': 1.0, 'Custom1': 1.0, 'CPU': 1.0}]
  180. example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:149 -- Node counts: defaultdict(<class 'int'>, {'head-node': 1, 'worker-nodes
  181. ': 1})
  182. example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:159 -- Placement group demands: []
  183. example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:186 -- Resource demands: []
  184. example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:187 -- Unfulfilled demands: []
  185. example-cluster2:2020-12-12 13:55:36,891 INFO resource_demand_scheduler.py:209 -- Node requests: {}
  186. example-cluster2:2020-12-12 13:55:36,903 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
  187. example-cluster2:2020-12-12 13:55:36,923 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
  188. Cleaning Up
  189. -----------
  190. We shut down a Ray cluster by deleting the associated RayCluster resource.
  191. Either of the next two commands will delete our second cluster ``example-cluster2``.
  192. .. code-block:: shell
  193. $ kubectl -n ray delete raycluster example-cluster2
  194. # OR
  195. $ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
  196. The pods associated with ``example-cluster2`` go into the ``TERMINATING`` phase. In a few moments, we check that these pods are gone:
  197. .. code-block:: shell
  198. $ kubectl -n ray get pods
  199. NAME READY STATUS RESTARTS AGE
  200. example-cluster-ray-head-th4wv 1/1 Running 0 57m
  201. example-cluster-ray-worker-q9pjn 1/1 Running 0 56m
  202. example-cluster-ray-worker-qltnp 1/1 Running 0 56m
  203. ray-operator-pod 1/1 Running 0 57m
  204. Only the operator pod and the first ``example-cluster`` remain.
  205. To finish clean-up, we delete the cluster ``example-cluster`` and then the operator's resources.
  206. .. code-block:: shell
  207. $ kubectl -n ray delete raycluster example-cluster
  208. $ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
  209. If you like, you can delete the RayCluster customer resource definition.
  210. (Using the operator again will then require reapplying the CRD.)
  211. .. code-block:: shell
  212. $ kubectl delete crd rayclusters.cluster.ray.io
  213. # OR
  214. $ kubectl delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
  215. .. _ray-k8s-interact:
  216. Interacting with a Ray Cluster
  217. ==============================
  218. :ref:`Ray Client <ray-client>` allows you to connect to your Ray cluster on Kubernetes and execute Ray programs.
  219. The Ray Client server runs the Ray head node, by default on port 10001.
  220. :ref:`Ray Dashboard <ray-dashboard>` gives visibility into the state of your cluster.
  221. By default, the dashboard uses port 8265 on the Ray head node.
  222. .. _k8s-service:
  223. Configuring a head node service
  224. -------------------------------
  225. To use Ray Client and Ray Dashboard,
  226. you can connect via a `Kubernetes Service`_ targeting the relevant ports on the head node:
  227. .. _svc-example:
  228. .. code-block:: yaml
  229. apiVersion: v1
  230. kind: Service
  231. metadata:
  232. name: example-cluster-ray-head
  233. spec:
  234. # This selector must match the head node pod's selector.
  235. selector:
  236. component: example-cluster-ray-head
  237. ports:
  238. - name: client
  239. protocol: TCP
  240. port: 10001
  241. targetPort: 10001
  242. - name: dashboard
  243. protocol: TCP
  244. port: 8265
  245. targetPort: 8265
  246. The head node pod's ``metadata`` should have a ``label`` matching the service's ``selector`` field:
  247. .. code-block:: yaml
  248. apiVersion: v1
  249. kind: Pod
  250. metadata:
  251. # Automatically generates a name for the pod with this prefix.
  252. generateName: example-cluster-ray-head-
  253. # Must match the head node service selector above if a head node
  254. # service is required.
  255. labels:
  256. component: example-cluster-ray-head
  257. - The Ray Kubernetes Operator automatically configures a default service exposing ports 10001 and 8265 \
  258. on the head node pod. The Operator also adds the relevant label to the head node pod's configuration. \
  259. If this default service does not suit your use case, you can modify the service or create a new one, \
  260. for example by using the tools ``kubectl edit``, ``kubectl create``, or ``kubectl apply``.
  261. - The Ray Cluster launcher does not automatically configure a service targeting the head node. A \
  262. head node service can be specified in the cluster launching config's ``provider.services`` field. The example cluster lauching \
  263. config `example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-full.yaml>`__ includes \
  264. the :ref:`above <svc-example>` service configuration as an example.
  265. After launching a Ray cluster with either the Operator or Cluster Launcher, you can view the configured service:
  266. .. code-block:: shell
  267. $ kubectl -n ray get services
  268. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  269. example-cluster-ray-head ClusterIP 10.106.123.159 <none> 10001/TCP,8265/TCP 52s
  270. .. _ray-k8s-run:
  271. Running Ray Programs
  272. --------------------
  273. Given a running Ray cluster and a :ref:`Service <k8s-service>` exposing the Ray Client server's port on the head pod,
  274. we can now run Ray programs on our cluster.
  275. In the following examples, we assume that we have a running Ray cluster with one head node and
  276. two worker nodes. This can be achieved in one of two ways:
  277. - Using the :ref:`Operator <k8s-operator>` with the example resource `ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml>`__.
  278. - Using :ref:`Cluster Launcher <k8s-cluster-launcher>`. Modify the example file `ray/python/ray/autoscaler/kubernetes/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-full.yaml>`__
  279. by setting the field ``available_node_types.worker_node.min_workers``
  280. to 2 and then run ``ray up`` with the modified config.
  281. Using Ray Client to connect from within the Kubernetes cluster
  282. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  283. You can connect to your Ray cluster from another pod in the same Kubernetes cluster.
  284. For example, you can submit a Ray application to run on the Kubernetes cluster as a `Kubernetes
  285. Job`_. The Job will run a single pod running the Ray driver program to
  286. completion, then terminate the pod but allow you to access the logs.
  287. The following command submits a Job which executes an `example Ray program`_.
  288. .. code-block:: yaml
  289. $ kubectl create -f ray/doc/kubernetes/job-example.yaml
  290. The program executed by the Job waits for three Ray nodes to connect and then tests object transfer
  291. between the nodes. Note that the program uses the environment variables
  292. ``EXAMPLE_CLUSTER_RAY_HEAD_SERVICE_HOST`` and ``EXAMPLE_CLUSTER_RAY_HEAD_SERVICE_PORT_CLIENT``
  293. to access Ray Client. These `environment variables`_ are set by Kubernetes based on
  294. the service we are using to expose the Ray head node.
  295. To view the output of the Job, first find the name of the pod that ran it,
  296. then fetch its logs:
  297. .. code-block:: shell
  298. $ kubectl -n ray get pods
  299. NAME READY STATUS RESTARTS AGE
  300. example-cluster-ray-head-rpqfb 1/1 Running 0 11m
  301. example-cluster-ray-worker-4c7cn 1/1 Running 0 11m
  302. example-cluster-ray-worker-zvglb 1/1 Running 0 11m
  303. ray-test-job-8x2pm-77lb5 1/1 Running 0 8s
  304. # Fetch the logs. You should see repeated output for 10 iterations and then
  305. # 'Success!'
  306. $ kubectl -n ray logs ray-test-job-8x2pm-77lb5
  307. To clean up the resources created by the Job after checking its output, run
  308. the following:
  309. .. code-block:: shell
  310. # List Jobs run in the Ray namespace.
  311. $ kubectl -n ray get jobs
  312. NAME COMPLETIONS DURATION AGE
  313. ray-test-job-kw5gn 1/1 10s 30s
  314. # Delete the finished Job.
  315. $ kubectl -n ray delete job ray-test-job-kw5gn
  316. # Verify that the Job's pod was cleaned up.
  317. $ kubectl -n ray get pods
  318. NAME READY STATUS RESTARTS AGE
  319. example-cluster-ray-head-rpqfb 1/1 Running 0 11m
  320. example-cluster-ray-worker-4c7cn 1/1 Running 0 11m
  321. example-cluster-ray-worker-zvglb 1/1 Running 0 11m
  322. .. _`environment variables`: https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables
  323. .. _`example Ray program`: https://github.com/ray-project/ray/blob/master/doc/kubernetes/example_scripts/job_example.py
  324. Using Ray Client to connect from outside the Kubernetes cluster
  325. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  326. To connect to the Ray cluster from outside your Kubernetes cluster,
  327. the head node Service needs to communicate with the outside world.
  328. One way to achieve this is by port-forwarding.
  329. Run the following command locally:
  330. .. code-block:: shell
  331. $ kubectl -n ray port-forward service/example-cluster-ray-head 10001:10001
  332. `Alternatively`, you can find the head node pod and connect to it directly with
  333. the following command:
  334. .. code-block:: shell
  335. # Substitute the name of your Ray cluster if using a name other than "example-cluster".
  336. $ kubectl -n ray port-forward \
  337. $(kubectl -n ray get pods -l ray-cluster-name=example-cluster -l ray-node-type=head -o custom-columns=:metadata.name) 10001:10001
  338. Then open a new shell and try out a sample program:
  339. .. code-block:: shell
  340. $ python ray/doc/kubernetes/example_scripts/run_local_example.py
  341. The program in this example uses ``ray.util.connect(127.0.0.1:10001)`` to connect to the Ray cluster.
  342. .. note::
  343. Connecting with Ray client requires using the matching minor versions of Python (for example 3.7)
  344. on the server and client end -- that is on the Ray head node and in the environment where
  345. ``ray.util.connect`` is invoked. Note that the default ``rayproject/ray`` images use Python 3.7.
  346. Nightly builds are now available for Python 3.6 and 3.8 at the `Ray Docker Hub <https://hub.docker.com/r/rayproject/ray/tags?page=1&ordering=last_updated&name=nightly-py>`_.
  347. Connecting with Ray client currently also requires matching Ray versions. In particular, to connect from a local machine to a cluster running the examples in this document, the :ref:`nightly <install-nightlies>` version of Ray must be installed locally.
  348. Running the program on the head node
  349. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  350. It is also possible to execute a Ray program on the Ray head node.
  351. (Replace the pod name with the name of your head pod
  352. - you can find it by running ``kubectl -n ray get pods``.)
  353. .. code-block:: shell
  354. # Copy the test script onto the head node.
  355. $ kubectl -n ray cp ray/doc/kubernetes/example_scripts/run_on_head.py example-cluster-ray-head-p9mfh:/home/ray
  356. # Run the example program on the head node.
  357. $ kubectl -n ray exec example-cluster-ray-head-p9mfh -- python /home/ray/run_on_head.py
  358. # You should see repeated output for 10 iterations and then 'Success!'
  359. Alternatively, you can run tasks interactively on the cluster by connecting a remote
  360. shell to one of the pods.
  361. .. code-block:: shell
  362. # Get a remote shell to the head node.
  363. $ kubectl -n ray exec -it example-cluster-ray-head-5455bb66c9-7l6xj -- bash
  364. # Run the example program on the head node.
  365. root@ray-head-6f566446c-5rdmb:/# python /home/ray/run_on_head.py
  366. # You should see repeated output for 10 iterations and then 'Success!'
  367. The program in this example uses ``ray.init(address="auto")`` to connect to the Ray cluster.
  368. Accessing the Dashboard
  369. -----------------------
  370. The Ray Dashboard can be accessed locally using ``kubectl port-forward``.
  371. .. code-block:: shell
  372. $ kubectl -n ray port-forward service/example-cluster-ray-head 8265:8265
  373. After running the above command locally, the Dashboard will be accessible at ``http://localhost:8265``.
  374. You can also monitor the state of the cluster with ``kubectl logs`` when using the :ref:`Operator <operator-logs>` or with ``ray monitor`` when using
  375. the :ref:`Ray Cluster Launcher <cluster-launcher-commands>`.
  376. .. warning::
  377. The Dashboard currently shows resource limits of the physical host each Ray node is running on,
  378. rather than the limits of the container the node is running in.
  379. This is a known bug tracked `here <https://github.com/ray-project/ray/issues/11172>`_.
  380. .. _k8s-comparison:
  381. Cluster Launcher vs Operator
  382. ============================
  383. We compare the Ray Cluster Launcher and Ray Kubernetes Operator as methods of managing an autoscaling Ray cluster.
  384. Comparison of use cases
  385. -----------------------
  386. - The Cluster Launcher is convenient for development and experimentation. Using the Cluster Launcher requires a local installation of Ray. The Ray CLI then provides a convenient interface for interacting with a Ray cluster.
  387. - The Operator is geared towards production use cases. It does not require installing Ray locally - all interactions with your Ray cluster are mediated by Kubernetes.
  388. Comparison of architectures
  389. ---------------------------
  390. - With the Cluster Launcher, the user launches a Ray cluster from their local environment by invoking ``ray up``. This provisions a pod for the Ray head node, which then runs the `autoscaling process <https://github.com/ray-project/ray/blob/master/python/ray/monitor.py>`__.
  391. - The `Operator <https://github.com/ray-project/ray/blob/master/python/ray/ray_operator/operator.py>`__ centralizes cluster launching and autoscaling in the `Operator pod <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml>`__. \
  392. The user creates a `Kubernetes Custom Resource`_ describing the intended state of the Ray cluster. \
  393. The Operator then detects the resource, launches a Ray cluster, and runs the autoscaling process in the operator pod. \
  394. The Operator can manage multiple Ray clusters by running an autoscaling process for each Ray cluster.
  395. Comparison of configuration options
  396. -----------------------------------
  397. The configuration options for the two methods are completely analogous - compare sample configurations for the `Cluster Launcher <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/kubernetes/example-full.yaml>`__
  398. and for the `Operator <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml>`__.
  399. With a few exceptions, the fields of the RayCluster resource managed by the Operator are camelCase versions of the corresponding snake_case Cluster Launcher fields.
  400. In fact, the Operator `internally <https://github.com/ray-project/ray/blob/master/python/ray/ray_operator/operator_utils.py>`__ converts
  401. RayCluster resources to Cluster Launching configs.
  402. A summary of the configuration differences:
  403. - The Cluster Launching field ``available_node_types`` for specifiying the types of pods available for autoscaling is renamed to ``podTypes`` in the Operator's RayCluster configuration.
  404. - The Cluster Launching field ``resources`` for specifying custom Ray resources provided by a node type is renamed to ``rayResources`` in the Operator's RayCluster configuration.
  405. - The ``provider`` field in the Cluster Launching config has no analogue in the Operator's RayCluster configuration. (The Operator fills this field internally.)
  406. - * When using the Cluster Launcher, ``head_ray_start_commands`` should include the argument ``--autoscaling-config=~/ray_bootstrap_config.yaml``; this is important for the configuration of the head node's autoscaler.
  407. * On the other hand, the Operator's ``headRayStartCommands`` should include a ``--no-monitor`` flag to prevent the autoscaling/monitoring process from running on the head node.
  408. Questions or Issues?
  409. --------------------
  410. .. include:: /_help.rst
  411. .. _`Kubernetes Job`: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
  412. .. _`Kubernetes Service`: https://kubernetes.io/docs/concepts/services-networking/service/
  413. .. _`Kubernetes Operator`: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
  414. .. _`Kubernetes Custom Resource`: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
  415. .. _`Kubernetes Custom Resource Definition`: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
  416. .. _`annotation`: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#attaching-metadata-to-objects
  417. .. _`permissions`: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  418. .. _`minikube`: https://minikube.sigs.k8s.io/docs/start/
  419. .. _`namespaced`: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/