config.rst 33 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102
  1. .. _cluster-config:
  2. Cluster YAML Configuration Options
  3. ==================================
  4. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the :ref:`Ray CLI <ray-cli>` to perform any operations such as starting and stopping the cluster.
  5. Syntax
  6. ------
  7. .. parsed-literal::
  8. :ref:`cluster_name <cluster-configuration-cluster-name>`: str
  9. :ref:`max_workers <cluster-configuration-max-workers>`: int
  10. :ref:`upscaling_speed <cluster-configuration-upscaling-speed>`: float
  11. :ref:`idle_timeout_minutes <cluster-configuration-idle-timeout-minutes>`: int
  12. :ref:`docker <cluster-configuration-docker>`:
  13. :ref:`docker <cluster-configuration-docker-type>`
  14. :ref:`provider <cluster-configuration-provider>`:
  15. :ref:`provider <cluster-configuration-provider-type>`
  16. :ref:`auth <cluster-configuration-auth>`:
  17. :ref:`auth <cluster-configuration-auth-type>`
  18. :ref:`available_node_types <cluster-configuration-available-node-types>`:
  19. :ref:`node_types <cluster-configuration-node-types-type>`
  20. :ref:`worker_nodes <cluster-configuration-worker-nodes>`:
  21. :ref:`node_config <cluster-configuration-node-config-type>`
  22. :ref:`head_node_type <cluster-configuration-head-node-type>`: str
  23. :ref:`file_mounts <cluster-configuration-file-mounts>`:
  24. :ref:`file_mounts <cluster-configuration-file-mounts-type>`
  25. :ref:`cluster_synced_files <cluster-configuration-cluster-synced-files>`:
  26. - str
  27. :ref:`rsync_exclude <cluster-configuration-rsync-exclude>`:
  28. - str
  29. :ref:`rsync_filter <cluster-configuration-rsync-filter>`:
  30. - str
  31. :ref:`initialization_commands <cluster-configuration-initialization-commands>`:
  32. - str
  33. :ref:`setup_commands <cluster-configuration-setup-commands>`:
  34. - str
  35. :ref:`head_setup_commands <cluster-configuration-head-setup-commands>`:
  36. - str
  37. :ref:`worker_setup_commands <cluster-configuration-worker-setup-commands>`:
  38. - str
  39. :ref:`head_start_ray_commands <cluster-configuration-head-start-ray-commands>`:
  40. - str
  41. :ref:`worker_start_ray_commands <cluster-configuration-worker-start-ray-commands>`:
  42. - str
  43. Custom types
  44. ------------
  45. .. _cluster-configuration-docker-type:
  46. Docker
  47. ~~~~~~
  48. .. parsed-literal::
  49. :ref:`image <cluster-configuration-image>`: str
  50. :ref:`head_image <cluster-configuration-head-image>`: str
  51. :ref:`worker_image <cluster-configuration-worker-image>`: str
  52. :ref:`container_name <cluster-configuration-container-name>`: str
  53. :ref:`pull_before_run <cluster-configuration-pull-before-run>`: bool
  54. :ref:`run_options <cluster-configuration-run-options>`:
  55. - str
  56. :ref:`head_run_options <cluster-configuration-head-run-options>`:
  57. - str
  58. :ref:`worker_run_options <cluster-configuration-worker-run-options>`:
  59. - str
  60. :ref:`disable_automatic_runtime_detection <cluster-configuration-disable-automatic-runtime-detection>`: bool
  61. :ref:`disable_shm_size_detection <cluster-configuration-disable-shm-size-detection>`: bool
  62. .. _cluster-configuration-auth-type:
  63. Auth
  64. ~~~~
  65. .. tabs::
  66. .. group-tab:: AWS
  67. .. parsed-literal::
  68. :ref:`ssh_user <cluster-configuration-ssh-user>`: str
  69. :ref:`ssh_private_key <cluster-configuration-ssh-private-key>`: str
  70. .. group-tab:: Azure
  71. .. parsed-literal::
  72. :ref:`ssh_user <cluster-configuration-ssh-user>`: str
  73. :ref:`ssh_private_key <cluster-configuration-ssh-private-key>`: str
  74. :ref:`ssh_public_key <cluster-configuration-ssh-public-key>`: str
  75. .. group-tab:: GCP
  76. .. parsed-literal::
  77. :ref:`ssh_user <cluster-configuration-ssh-user>`: str
  78. :ref:`ssh_private_key <cluster-configuration-ssh-private-key>`: str
  79. .. _cluster-configuration-provider-type:
  80. Provider
  81. ~~~~~~~~
  82. .. tabs::
  83. .. group-tab:: AWS
  84. .. parsed-literal::
  85. :ref:`type <cluster-configuration-type>`: str
  86. :ref:`region <cluster-configuration-region>`: str
  87. :ref:`availability_zone <cluster-configuration-availability-zone>`: str
  88. :ref:`cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>`: bool
  89. .. group-tab:: Azure
  90. .. parsed-literal::
  91. :ref:`type <cluster-configuration-type>`: str
  92. :ref:`location <cluster-configuration-location>`: str
  93. :ref:`resource_group <cluster-configuration-resource-group>`: str
  94. :ref:`subscription_id <cluster-configuration-subscription-id>`: str
  95. :ref:`cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>`: bool
  96. .. group-tab:: GCP
  97. .. parsed-literal::
  98. :ref:`type <cluster-configuration-type>`: str
  99. :ref:`region <cluster-configuration-region>`: str
  100. :ref:`availability_zone <cluster-configuration-availability-zone>`: str
  101. :ref:`project_id <cluster-configuration-project-id>`: str
  102. :ref:`cache_stopped_nodes <cluster-configuration-cache-stopped-nodes>`: bool
  103. .. _cluster-configuration-node-types-type:
  104. Node types
  105. ~~~~~~~~~~
  106. The nodes types object's keys represent the names of the different node types.
  107. .. parsed-literal::
  108. <node_type_1_name>:
  109. :ref:`node_config <cluster-configuration-node-config>`:
  110. :ref:`Node config <cluster-configuration-node-config-type>`
  111. :ref:`resources <cluster-configuration-resources>`:
  112. :ref:`Resources <cluster-configuration-resources-type>`
  113. :ref:`min_workers <cluster-configuration-node-min-workers>`: int
  114. :ref:`max_workers <cluster-configuration-node-max-workers>`: int
  115. :ref:`worker_setup_commands <cluster-configuration-node-type-worker-setup-commands>`:
  116. - str
  117. :ref:`docker <cluster-configuration-node-docker>`:
  118. :ref:`Node Docker <cluster-configuration-node-docker-type>`
  119. <node_type_2_name>:
  120. ...
  121. ...
  122. .. _cluster-configuration-node-config-type:
  123. Node config
  124. ~~~~~~~~~~~
  125. .. tabs::
  126. .. group-tab:: AWS
  127. A YAML object which conforms to the EC2 ``create_instances`` API in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances>`_.
  128. .. group-tab:: Azure
  129. A YAML object as defined in `the deployment template <https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/virtualmachines>`_ whose resources are defined in `the Azure docs <https://docs.microsoft.com/en-us/azure/templates/>`_.
  130. .. group-tab:: GCP
  131. A YAML object as defined in `the GCP docs <https://cloud.google.com/compute/docs/reference/rest/v1/instances>`_.
  132. .. _cluster-configuration-node-docker-type:
  133. Node Docker
  134. ~~~~~~~~~~~
  135. .. parsed-literal::
  136. :ref:`image <cluster-configuration-image>`: str
  137. :ref:`pull_before_run <cluster-configuration-pull-before-run>`: bool
  138. :ref:`worker_run_options <cluster-configuration-worker-run-options>`:
  139. - str
  140. :ref:`disable_automatic_runtime_detection <cluster-configuration-disable-automatic-runtime-detection>`: bool
  141. :ref:`disable_shm_size_detection <cluster-configuration-disable-shm-size-detection>`: bool
  142. .. _cluster-configuration-resources-type:
  143. Resources
  144. ~~~~~~~~~
  145. .. parsed-literal::
  146. :ref:`CPU <cluster-configuration-CPU>`: int
  147. :ref:`GPU <cluster-configuration-GPU>`: int
  148. <custom_resource1>: int
  149. <custom_resource2>: int
  150. ...
  151. .. _cluster-configuration-file-mounts-type:
  152. File mounts
  153. ~~~~~~~~~~~
  154. .. parsed-literal::
  155. <path1_on_remote_machine>: str # Path 1 on local machine
  156. <path2_on_remote_machine>: str # Path 2 on local machine
  157. ...
  158. Properties and Definitions
  159. --------------------------
  160. .. _cluster-configuration-cluster-name:
  161. ``cluster_name``
  162. ~~~~~~~~~~~~~~~~
  163. The name of the cluster. This is the namespace of the cluster.
  164. * **Required:** Yes
  165. * **Importance:** High
  166. * **Type:** String
  167. * **Default:** "default"
  168. * **Pattern:** ``[a-zA-Z0-9_]+``
  169. .. _cluster-configuration-max-workers:
  170. ``max_workers``
  171. ~~~~~~~~~~~~~~~
  172. The maximum number of workers the cluster will have at any given time.
  173. * **Required:** No
  174. * **Importance:** High
  175. * **Type:** Integer
  176. * **Default:** ``2``
  177. * **Minimum:** ``0``
  178. * **Maximum:** Unbounded
  179. .. _cluster-configuration-upscaling-speed:
  180. ``upscaling_speed``
  181. ~~~~~~~~~~~~~~~~~~~
  182. The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed.
  183. * **Required:** No
  184. * **Importance:** Medium
  185. * **Type:** Float
  186. * **Default:** ``1.0``
  187. * **Minimum:** ``0.0``
  188. * **Maximum:** Unbounded
  189. .. _cluster-configuration-idle-timeout-minutes:
  190. ``idle_timeout_minutes``
  191. ~~~~~~~~~~~~~~~~~~~~~~~~
  192. The number of minutes that need to pass before an idle worker node is removed by the Autoscaler.
  193. * **Required:** No
  194. * **Importance:** Medium
  195. * **Type:** Integer
  196. * **Default:** ``5``
  197. * **Minimum:** ``0``
  198. * **Maximum:** Unbounded
  199. .. _cluster-configuration-docker:
  200. ``docker``
  201. ~~~~~~~~~~
  202. Configure Ray to run in Docker containers.
  203. * **Required:** No
  204. * **Importance:** High
  205. * **Type:** :ref:`Docker <cluster-configuration-docker-type>`
  206. * **Default:** ``{}``
  207. In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to :ref:`initialization_commands <cluster-configuration-initialization-commands>` to install it.
  208. .. code-block:: yaml
  209. initialization_commands:
  210. - curl -fsSL https://get.docker.com -o get-docker.sh
  211. - sudo sh get-docker.sh
  212. - sudo usermod -aG docker $USER
  213. - sudo systemctl restart docker -f
  214. .. _cluster-configuration-provider:
  215. ``provider``
  216. ~~~~~~~~~~~~
  217. The cloud provider-specific configuration properties.
  218. * **Required:** Yes
  219. * **Importance:** High
  220. * **Type:** :ref:`Provider <cluster-configuration-provider-type>`
  221. .. _cluster-configuration-auth:
  222. ``auth``
  223. ~~~~~~~~
  224. Authentication credentials that Ray will use to launch nodes.
  225. * **Required:** Yes
  226. * **Importance:** High
  227. * **Type:** :ref:`Auth <cluster-configuration-auth-type>`
  228. .. _cluster-configuration-available-node-types:
  229. ``available_node_types``
  230. ~~~~~~~~~~~~~~~~~~~~~~~~
  231. Tells the autoscaler the allowed node types and the resources they provide.
  232. The key is the name of the node type, which is just for debugging purposes.
  233. * **Required:** No
  234. * **Importance:** High
  235. * **Type:** :ref:`Node types <cluster-configuration-node-types-type>`
  236. * **Default:**
  237. .. tabs::
  238. .. group-tab:: AWS
  239. .. code-block:: yaml
  240. available_node_types:
  241. ray.head.default:
  242. node_config:
  243. InstanceType: m5.large
  244. BlockDeviceMappings:
  245. - DeviceName: /dev/sda1
  246. Ebs:
  247. VolumeSize: 100
  248. resources: {"CPU": 2}
  249. min_workers: 0
  250. max_workers: 0
  251. ray.worker.default:
  252. node_config:
  253. InstanceType: m5.large
  254. InstanceMarketOptions:
  255. MarketType: spot
  256. resources: {"CPU": 2}
  257. min_workers: 0
  258. .. _cluster-configuration-head-node-type:
  259. ``head_node_type``
  260. ~~~~~~~~~~~~~~~~~~
  261. The key for one of the node types in :ref:`available_node_types <cluster-configuration-available-node-types>`. This node type will be used to launch the head node.
  262. * **Required:** Yes
  263. * **Importance:** High
  264. * **Type:** String
  265. * **Pattern:** ``[a-zA-Z0-9_]+``
  266. .. _cluster-configuration-worker-nodes:
  267. ``worker_nodes``
  268. ~~~~~~~~~~~~~~~~
  269. The configuration to be used to launch worker nodes on the cloud service provider. Generally, node configs are set in the :ref:`node config of each node type <cluster-configuration-node-config>`. Setting this property allows propagation of a default value to all the node types when they launch as workers (e.g., using spot instances across all workers can be configured here so that it doesn't have to be set across all instance types).
  270. * **Required:** No
  271. * **Importance:** Low
  272. * **Type:** :ref:`Node config <cluster-configuration-node-config-type>`
  273. * **Default:** ``{}``
  274. .. _cluster-configuration-file-mounts:
  275. ``file_mounts``
  276. ~~~~~~~~~~~~~~~
  277. The files or directories to copy to the head and worker nodes.
  278. * **Required:** No
  279. * **Importance:** High
  280. * **Type:** :ref:`File mounts <cluster-configuration-file-mounts-type>`
  281. * **Default:** ``[]``
  282. .. _cluster-configuration-cluster-synced-files:
  283. ``cluster_synced_files``
  284. ~~~~~~~~~~~~~~~~~~~~~~~~
  285. A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use :ref:`file_mounts <cluster-configuration-file-mounts>`.
  286. * **Required:** No
  287. * **Importance:** Low
  288. * **Type:** List of String
  289. * **Default:** ``[]``
  290. .. _cluster-configuration-rsync-exclude:
  291. ``rsync_exclude``
  292. ~~~~~~~~~~~~~~~~~
  293. A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory only.
  294. Example for a pattern in the list: ``**/.git/**``.
  295. * **Required:** No
  296. * **Importance:** Low
  297. * **Type:** List of String
  298. * **Default:** ``[]``
  299. .. _cluster-configuration-rsync-filter:
  300. ``rsync_filter``
  301. ~~~~~~~~~~~~~~~~
  302. A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory and recursively through all subdirectories.
  303. Example for a pattern in the list: ``.gitignore``.
  304. * **Required:** No
  305. * **Importance:** Low
  306. * **Type:** List of String
  307. * **Default:** ``[]``
  308. .. _cluster-configuration-initialization-commands:
  309. ``initialization_commands``
  310. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  311. A list of commands that will be run before the :ref:`setup commands <cluster-configuration-setup-commands>`. If Docker is enabled, these commands will run outside the container and before Docker is setup.
  312. * **Required:** No
  313. * **Importance:** Medium
  314. * **Type:** List of String
  315. * **Default:** ``[]``
  316. .. _cluster-configuration-setup-commands:
  317. ``setup_commands``
  318. ~~~~~~~~~~~~~~~~~~
  319. A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with :ref:`head setup commands <cluster-configuration-head-setup-commands>` for head and with :ref:`worker setup commands <cluster-configuration-worker-setup-commands>` for workers.
  320. * **Required:** No
  321. * **Importance:** Medium
  322. * **Type:** List of String
  323. * **Default:**
  324. .. tabs::
  325. .. group-tab:: AWS
  326. .. code-block:: yaml
  327. # Default setup_commands:
  328. setup_commands:
  329. - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
  330. - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
  331. - Setup commands should ideally be *idempotent* (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g. ``git clone foo`` can be rewritten as ``test -e foo || git clone foo`` which checks if the repo is already cloned first.
  332. - Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run ``conda activate env && pip install -U ray`` because splitting the command into two setup commands will not work.
  333. - Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time.
  334. - **Tip**: if you also want to run apt-get commands during setup add the following list of commands:
  335. .. code-block:: yaml
  336. setup_commands:
  337. - sudo pkill -9 apt-get || true
  338. - sudo pkill -9 dpkg || true
  339. - sudo dpkg --configure -a
  340. .. _cluster-configuration-head-setup-commands:
  341. ``head_setup_commands``
  342. ~~~~~~~~~~~~~~~~~~~~~~~
  343. A list of commands to run to set up the head node. These commands will be merged with the general :ref:`setup commands <cluster-configuration-setup-commands>`.
  344. * **Required:** No
  345. * **Importance:** Low
  346. * **Type:** List of String
  347. * **Default:** ``[]``
  348. .. _cluster-configuration-worker-setup-commands:
  349. ``worker_setup_commands``
  350. ~~~~~~~~~~~~~~~~~~~~~~~~~
  351. A list of commands to run to set up the worker nodes. These commands will be merged with the general :ref:`setup commands <cluster-configuration-setup-commands>`.
  352. * **Required:** No
  353. * **Importance:** Low
  354. * **Type:** List of String
  355. * **Default:** ``[]``
  356. .. _cluster-configuration-head-start-ray-commands:
  357. ``head_start_ray_commands``
  358. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  359. Commands to start ray on the head node. You don't need to change this.
  360. * **Required:** No
  361. * **Importance:** Low
  362. * **Type:** List of String
  363. * **Default:**
  364. .. tabs::
  365. .. group-tab:: AWS
  366. .. code-block:: yaml
  367. head_start_ray_commands:
  368. - ray stop
  369. - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
  370. .. _cluster-configuration-worker-start-ray-commands:
  371. ``worker_start_ray_commands``
  372. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  373. Command to start ray on worker nodes. You don't need to change this.
  374. * **Required:** No
  375. * **Importance:** Low
  376. * **Type:** List of String
  377. * **Default:**
  378. .. tabs::
  379. .. group-tab:: AWS
  380. .. code-block:: yaml
  381. worker_start_ray_commands:
  382. - ray stop
  383. - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
  384. .. _cluster-configuration-image:
  385. ``docker.image``
  386. ~~~~~~~~~~~~~~~~
  387. The default Docker image to pull in the head and worker nodes. This can be overridden by the :ref:`head_image <cluster-configuration-head-image>` and :ref:`worker_image <cluster-configuration-worker-image>` fields. If neither `image` nor (:ref:`head_image <cluster-configuration-head-image>` and :ref:`worker_image <cluster-configuration-worker-image>`) are specified, Ray will not use Docker.
  388. * **Required:** Yes (If Docker is in use.)
  389. * **Importance:** High
  390. * **Type:** String
  391. The Ray project provides Docker images on `DockerHub <https://hub.docker.com/u/rayproject>`_. The repository includes following images:
  392. * ``rayproject/ray-ml:latest-gpu``: CUDA support, includes ML dependencies.
  393. * ``rayproject/ray:latest-gpu``: CUDA support, no ML dependencies.
  394. * ``rayproject/ray-ml:latest``: No CUDA support, includes ML dependencies.
  395. * ``rayproject/ray:latest``: No CUDA support, no ML dependencies.
  396. .. _cluster-configuration-head-image:
  397. ``docker.head_image``
  398. ~~~~~~~~~~~~~~~~~~~~~
  399. Docker image for the head node to override the default :ref:`docker image <cluster-configuration-image>`.
  400. * **Required:** No
  401. * **Importance:** Low
  402. * **Type:** String
  403. .. _cluster-configuration-worker-image:
  404. ``docker.worker_image``
  405. ~~~~~~~~~~~~~~~~~~~~~~~
  406. Docker image for the worker nodes to override the default :ref:`docker image <cluster-configuration-image>`.
  407. * **Required:** No
  408. * **Importance:** Low
  409. * **Type:** String
  410. .. _cluster-configuration-container-name:
  411. ``docker.container_name``
  412. ~~~~~~~~~~~~~~~~~~~~~~~~~
  413. The name to use when starting the Docker container.
  414. * **Required:** Yes (If Docker is in use.)
  415. * **Importance:** Low
  416. * **Type:** String
  417. * **Default:** ray_container
  418. .. _cluster-configuration-pull-before-run:
  419. ``docker.pull_before_run``
  420. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  421. If enabled, the latest version of image will be pulled when starting Docker. If disabled, ``docker run`` will only pull the image if no cached version is present.
  422. * **Required:** No
  423. * **Importance:** Medium
  424. * **Type:** Boolean
  425. * **Default:** ``True``
  426. .. _cluster-configuration-run-options:
  427. ``docker.run_options``
  428. ~~~~~~~~~~~~~~~~~~~~~~
  429. The extra options to pass to ``docker run``.
  430. * **Required:** No
  431. * **Importance:** Medium
  432. * **Type:** List of String
  433. * **Default:** ``[]``
  434. .. _cluster-configuration-head-run-options:
  435. ``docker.head_run_options``
  436. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  437. The extra options to pass to ``docker run`` for head node only.
  438. * **Required:** No
  439. * **Importance:** Low
  440. * **Type:** List of String
  441. * **Default:** ``[]``
  442. .. _cluster-configuration-worker-run-options:
  443. ``docker.worker_run_options``
  444. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  445. The extra options to pass to ``docker run`` for worker nodes only.
  446. * **Required:** No
  447. * **Importance:** Low
  448. * **Type:** List of String
  449. * **Default:** ``[]``
  450. .. _cluster-configuration-disable-automatic-runtime-detection:
  451. ``docker.disable_automatic_runtime_detection``
  452. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  453. If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present.
  454. * **Required:** No
  455. * **Importance:** Low
  456. * **Type:** Boolean
  457. * **Default:** ``False``
  458. .. _cluster-configuration-disable-shm-size-detection:
  459. ``docker.disable_shm_size_detection``
  460. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  461. If enabled, Ray will not automatically specify the size ``/dev/shm`` for the started container and the runtime's default value (64MiB for Docker) will be used.
  462. If ``--shm-size=<>`` is manually added to ``run_options``, this is *automatically* set to ``True``, meaning that Ray will defer to the user-provided value.
  463. * **Required:** No
  464. * **Importance:** Low
  465. * **Type:** Boolean
  466. * **Default:** ``False``
  467. .. _cluster-configuration-ssh-user:
  468. ``auth.ssh_user``
  469. ~~~~~~~~~~~~~~~~~
  470. The user that Ray will authenticate with when launching new nodes.
  471. * **Required:** Yes
  472. * **Importance:** High
  473. * **Type:** String
  474. .. _cluster-configuration-ssh-private-key:
  475. ``auth.ssh_private_key``
  476. ~~~~~~~~~~~~~~~~~~~~~~~~
  477. .. tabs::
  478. .. group-tab:: AWS
  479. The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration <cluster-configuration-node-config>`.
  480. * **Required:** No
  481. * **Importance:** Low
  482. * **Type:** String
  483. .. group-tab:: Azure
  484. The path to an existing private key for Ray to use.
  485. * **Required:** Yes
  486. * **Importance:** High
  487. * **Type:** String
  488. You may use ``ssh-keygen -t rsa -b 4096`` to generate a new ssh keypair.
  489. .. group-tab:: GCP
  490. The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration <cluster-configuration-node-config>`.
  491. * **Required:** No
  492. * **Importance:** Low
  493. * **Type:** String
  494. .. _cluster-configuration-ssh-public-key:
  495. ``auth.ssh_public_key``
  496. ~~~~~~~~~~~~~~~~~~~~~~~
  497. .. tabs::
  498. .. group-tab:: AWS
  499. Not available.
  500. .. group-tab:: Azure
  501. The path to an existing public key for Ray to use.
  502. * **Required:** Yes
  503. * **Importance:** High
  504. * **Type:** String
  505. .. group-tab:: GCP
  506. Not available.
  507. .. _cluster-configuration-type:
  508. ``provider.type``
  509. ~~~~~~~~~~~~~~~~~
  510. .. tabs::
  511. .. group-tab:: AWS
  512. The cloud service provider. For AWS, this must be set to ``aws``.
  513. * **Required:** Yes
  514. * **Importance:** High
  515. * **Type:** String
  516. .. group-tab:: Azure
  517. The cloud service provider. For Azure, this must be set to ``azure``.
  518. * **Required:** Yes
  519. * **Importance:** High
  520. * **Type:** String
  521. .. group-tab:: GCP
  522. The cloud service provider. For GCP, this must be set to ``gcp``.
  523. * **Required:** Yes
  524. * **Importance:** High
  525. * **Type:** String
  526. .. _cluster-configuration-region:
  527. ``provider.region``
  528. ~~~~~~~~~~~~~~~~~~~
  529. .. tabs::
  530. .. group-tab:: AWS
  531. The region to use for deployment of the Ray cluster.
  532. * **Required:** Yes
  533. * **Importance:** High
  534. * **Type:** String
  535. * **Default:** us-west-2
  536. .. group-tab:: Azure
  537. Not available.
  538. .. group-tab:: GCP
  539. The region to use for deployment of the Ray cluster.
  540. * **Required:** Yes
  541. * **Importance:** High
  542. * **Type:** String
  543. * **Default:** us-west1
  544. .. _cluster-configuration-availability-zone:
  545. ``provider.availability_zone``
  546. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  547. .. tabs::
  548. .. group-tab:: AWS
  549. A string specifying a comma-separated list of availability zone(s) that nodes may be launched in.
  550. * **Required:** No
  551. * **Importance:** Low
  552. * **Type:** String
  553. * **Default:** us-west-2a,us-west-2b
  554. .. group-tab:: Azure
  555. Not available.
  556. .. group-tab:: GCP
  557. A string specifying a comma-separated list of availability zone(s) that nodes may be launched in.
  558. * **Required:** No
  559. * **Importance:** Low
  560. * **Type:** String
  561. * **Default:** us-west1-a
  562. .. _cluster-configuration-location:
  563. ``provider.location``
  564. ~~~~~~~~~~~~~~~~~~~~~
  565. .. tabs::
  566. .. group-tab:: AWS
  567. Not available.
  568. .. group-tab:: Azure
  569. The location to use for deployment of the Ray cluster.
  570. * **Required:** Yes
  571. * **Importance:** High
  572. * **Type:** String
  573. * **Default:** westus2
  574. .. group-tab:: GCP
  575. Not available.
  576. .. _cluster-configuration-resource-group:
  577. ``provider.resource_group``
  578. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  579. .. tabs::
  580. .. group-tab:: AWS
  581. Not available.
  582. .. group-tab:: Azure
  583. The resource group to use for deployment of the Ray cluster.
  584. * **Required:** Yes
  585. * **Importance:** High
  586. * **Type:** String
  587. * **Default:** ray-cluster
  588. .. group-tab:: GCP
  589. Not available.
  590. .. _cluster-configuration-subscription-id:
  591. ``provider.subscription_id``
  592. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  593. .. tabs::
  594. .. group-tab:: AWS
  595. Not available.
  596. .. group-tab:: Azure
  597. The subscription ID to use for deployment of the Ray cluster. If not specified, Ray will use the default from the Azure CLI.
  598. * **Required:** No
  599. * **Importance:** High
  600. * **Type:** String
  601. * **Default:** ``""``
  602. .. group-tab:: GCP
  603. Not available.
  604. .. _cluster-configuration-project-id:
  605. ``provider.project_id``
  606. ~~~~~~~~~~~~~~~~~~~~~~~
  607. .. tabs::
  608. .. group-tab:: AWS
  609. Not available.
  610. .. group-tab:: Azure
  611. Not available.
  612. .. group-tab:: GCP
  613. The globally unique project ID to use for deployment of the Ray cluster.
  614. * **Required:** No
  615. * **Importance:** Low
  616. * **Type:** String
  617. * **Default:** ``null``
  618. .. _cluster-configuration-cache-stopped-nodes:
  619. ``provider.cache_stopped_nodes``
  620. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  621. If enabled, nodes will be *stopped* when the cluster scales down. If disabled, nodes will be *terminated* instead. Stopped nodes launch faster than terminated nodes.
  622. * **Required:** No
  623. * **Importance:** Low
  624. * **Type:** Boolean
  625. * **Default:** ``True``
  626. .. _cluster-configuration-node-config:
  627. ``available_node_types.<node_type_name>.node_type.node_config``
  628. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  629. The configuration to be used to launch the nodes on the cloud service provider. Among other things, this will specify the instance type to be launched.
  630. * **Required:** Yes
  631. * **Importance:** High
  632. * **Type:** :ref:`Node config <cluster-configuration-node-config-type>`
  633. .. _cluster-configuration-resources:
  634. ``available_node_types.<node_type_name>.node_type.resources``
  635. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  636. The resources that a node type provides, which enables the autoscaler to automatically select the right type of nodes to launch given the resource demands of the application. The resources specified will be automatically passed to the ``ray start`` command for the node via an environment variable. If not provided, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. For more information, see also the `resource demand scheduler <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py>`_
  637. * **Required:** Yes (except for AWS/K8s)
  638. * **Importance:** High
  639. * **Type:** :ref:`Resources <cluster-configuration-resources-type>`
  640. * **Default:** ``{}``
  641. In some cases, adding special nodes without any resources may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. In order to manually add a node to an autoscaled cluster, the *ray-cluster-name* tag should be set and *ray-node-type* tag should be set to unmanaged. Unmanaged nodes can be created by setting the resources to ``{}`` and the :ref:`maximum workers <cluster-configuration-node-min-workers>` to 0. The Autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes.
  642. .. _cluster-configuration-node-min-workers:
  643. ``available_node_types.<node_type_name>.node_type.min_workers``
  644. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  645. The minimum number of workers to maintain for this node type regardless of utilization.
  646. * **Required:** No
  647. * **Importance:** High
  648. * **Type:** Integer
  649. * **Default:** ``0``
  650. * **Minimum:** ``0``
  651. * **Maximum:** Unbounded
  652. .. _cluster-configuration-node-max-workers:
  653. ``available_node_types.<node_type_name>.node_type.max_workers``
  654. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  655. The maximum number of workers to have in the cluster for this node type regardless of utilization. This takes precedence over :ref:`minimum workers <cluster-configuration-node-min-workers>`. By default, the number of workers of a node type is unbounded, constrained only by the cluster-wide :ref:`max_workers <cluster-configuration-max-workers>`.
  656. * **Required:** No
  657. * **Importance:** High
  658. * **Type:** Integer
  659. * **Default:** cluster-wide :ref:`max_workers <cluster-configuration-max-workers>`
  660. * **Minimum:** ``0``
  661. * **Maximum:** cluster-wide :ref:`max_workers <cluster-configuration-max-workers>`
  662. .. _cluster-configuration-node-type-worker-setup-commands:
  663. ``available_node_types.<node_type_name>.node_type.worker_setup_commands``
  664. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  665. A list of commands to run to set up worker nodes of this type. These commands will replace the general :ref:`worker setup commands <cluster-configuration-worker-setup-commands>` for the node.
  666. * **Required:** No
  667. * **Importance:** low
  668. * **Type:** List of String
  669. * **Default:** ``[]``
  670. .. _cluster-configuration-cpu:
  671. ``available_node_types.<node_type_name>.node_type.resources.CPU``
  672. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  673. .. tabs::
  674. .. group-tab:: AWS
  675. The number of CPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.
  676. * **Required:** Yes (except for AWS/K8s)
  677. * **Importance:** High
  678. * **Type:** Integer
  679. .. group-tab:: Azure
  680. The number of CPUs made available by this node.
  681. * **Required:** Yes
  682. * **Importance:** High
  683. * **Type:** Integer
  684. .. group-tab:: GCP
  685. The number of CPUs made available by this node.
  686. * **Required:** No
  687. * **Importance:** High
  688. * **Type:** Integer
  689. .. _cluster-configuration-gpu:
  690. ``available_node_types.<node_type_name>.node_type.resources.GPU``
  691. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  692. .. tabs::
  693. .. group-tab:: AWS
  694. The number of GPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.
  695. * **Required:** No
  696. * **Importance:** Low
  697. * **Type:** Integer
  698. .. group-tab:: Azure
  699. The number of GPUs made available by this node.
  700. * **Required:** No
  701. * **Importance:** High
  702. * **Type:** Integer
  703. .. group-tab:: GCP
  704. The number of GPUs made available by this node.
  705. * **Required:** No
  706. * **Importance:** High
  707. * **Type:** Integer
  708. .. _cluster-configuration-node-docker:
  709. ``available_node_types.<node_type_name>.docker``
  710. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  711. A set of overrides to the top-level :ref:`Docker <cluster-configuration-docker>` configuration.
  712. * **Required:** No
  713. * **Importance:** Low
  714. * **Type:** :ref:`docker <cluster-configuration-node-docker-type>`
  715. * **Default:** ``{}``
  716. Examples
  717. --------
  718. Minimal configuration
  719. ~~~~~~~~~~~~~~~~~~~~~
  720. .. tabs::
  721. .. group-tab:: AWS
  722. .. literalinclude:: ../../../python/ray/autoscaler/aws/example-minimal.yaml
  723. :language: yaml
  724. .. group-tab:: Azure
  725. .. literalinclude:: ../../../python/ray/autoscaler/azure/example-minimal.yaml
  726. :language: yaml
  727. .. group-tab:: GCP
  728. .. literalinclude:: ../../../python/ray/autoscaler/gcp/example-minimal.yaml
  729. :language: yaml
  730. Full configuration
  731. ~~~~~~~~~~~~~~~~~~
  732. .. tabs::
  733. .. group-tab:: AWS
  734. .. literalinclude:: ../../../python/ray/autoscaler/aws/example-full.yaml
  735. :language: yaml
  736. .. group-tab:: Azure
  737. .. literalinclude:: ../../../python/ray/autoscaler/azure/example-full.yaml
  738. :language: yaml
  739. .. group-tab:: GCP
  740. .. literalinclude:: ../../../python/ray/autoscaler/gcp/example-full.yaml
  741. :language: yaml