kubernetes-gpu.rst 3.7 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
  1. :orphan:
  2. .. _k8s-gpus:
  3. GPU Usage with Kubernetes
  4. =========================
  5. This document provides some notes on GPU usage with Kubernetes.
  6. To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.
  7. For relevant documentation for GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_.
  8. The `Ray Docker Hub <https://hub.docker.com/r/rayproject/>`_ hosts CUDA-based images packaged with Ray for use in Kubernetes pods.
  9. For example, the image ``rayproject/ray-ml:nightly-gpu`` is ideal for running GPU-based ML workloads with the most recent nightly build of Ray.
  10. Read :ref:`here<docker-images>` for further details on Ray images.
  11. Using Nvidia GPUs requires specifying the relevant resource `limits` in the container fields of your Kubernetes configurations.
  12. (Kubernetes `sets <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_
  13. the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and
  14. using one Nvidia GPU looks like this:
  15. .. code-block:: yaml
  16. apiVersion: v1
  17. kind: Pod
  18. metadata:
  19. generateName: example-cluster-ray-worker
  20. spec:
  21. ...
  22. containers:
  23. - name: ray-node
  24. image: rayproject/ray:nightly-gpu
  25. ...
  26. resources:
  27. cpu: 1000m
  28. memory: 512Mi
  29. limits:
  30. memory: 512Mi
  31. nvidia.com/gpu: 1
  32. GPU taints and tolerations
  33. --------------------------
  34. .. note::
  35. Users using a managed Kubernetes service probably don't need to worry about this section.
  36. The `Nvidia gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes.
  37. Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_
  38. to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_.
  39. If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,
  40. .. code-block:: yaml
  41. apiVersion: v1
  42. kind: Pod
  43. metadata:
  44. generateName: example-cluster-ray-worker
  45. spec:
  46. ...
  47. tolerations:
  48. - effect: NoSchedule
  49. key: nvidia.com/gpu
  50. operator: Exists
  51. ...
  52. containers:
  53. - name: ray-node
  54. image: rayproject/ray:nightly-gpu
  55. ...
  56. Further reference and discussion
  57. --------------------------------
  58. Read about Kubernetes device plugins `here <https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/>`__,
  59. about Kubernetes GPU plugins `here <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus>`__,
  60. and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8s-device-plugin>`__.
  61. If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at `<https://discuss.ray.io>`_.
  62. Questions or Issues?
  63. --------------------
  64. .. include:: /_help.rst
  65. .. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
  66. .. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
  67. .. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
  68. .. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
  69. .. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
  70. .. _`Nvidia gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin
  71. .. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
  72. .. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration