joblib.rst 2.8 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
  1. .. _ray-joblib:
  2. Distributed Scikit-learn / Joblib
  3. =================================
  4. .. _`issue on GitHub`: https://github.com/ray-project/ray/issues
  5. Ray supports running distributed `scikit-learn`_ programs by
  6. implementing a Ray backend for `joblib`_ using `Ray Actors <actors.html>`__
  7. instead of local processes. This makes it easy to scale existing applications
  8. that use scikit-learn from a single node to a cluster.
  9. .. note::
  10. This API is new and may be revised in future Ray releases. If you encounter
  11. any bugs, please file an `issue on GitHub`_.
  12. .. _`joblib`: https://joblib.readthedocs.io
  13. .. _`scikit-learn`: https://scikit-learn.org
  14. Quickstart
  15. ----------
  16. To get started, first `install Ray <installation.html>`__, then use
  17. ``from ray.util.joblib import register_ray`` and run ``register_ray()``.
  18. This will register Ray as a joblib backend for scikit-learn to use.
  19. Then run your original scikit-learn code inside
  20. ``with joblib.parallel_backend('ray')``. This will start a local Ray cluster.
  21. See the `Run on a Cluster`_ section below for instructions to run on
  22. a multi-node Ray cluster instead.
  23. .. code-block:: python
  24. import numpy as np
  25. from sklearn.datasets import load_digits
  26. from sklearn.model_selection import RandomizedSearchCV
  27. from sklearn.svm import SVC
  28. digits = load_digits()
  29. param_space = {
  30. 'C': np.logspace(-6, 6, 30),
  31. 'gamma': np.logspace(-8, 8, 30),
  32. 'tol': np.logspace(-4, -1, 30),
  33. 'class_weight': [None, 'balanced'],
  34. }
  35. model = SVC(kernel='rbf')
  36. search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)
  37. import joblib
  38. from ray.util.joblib import register_ray
  39. register_ray()
  40. with joblib.parallel_backend('ray'):
  41. search.fit(digits.data, digits.target)
  42. You can also set the ``ray_remote_args`` argument in ``parallel_backend`` to :func:`configure
  43. the Ray Actors <ray.remote>` making up the Pool. This can be used to eg. :ref:`assign resources
  44. to Actors, such as GPUs <actor-resource-guide>`.
  45. .. code-block:: python
  46. # Allows to use GPU-enabled estimators, such as cuML
  47. with joblib.parallel_backend('ray', ray_remote_args=dict(num_gpus=1)):
  48. search.fit(digits.data, digits.target)
  49. Run on a Cluster
  50. ----------------
  51. This section assumes that you have a running Ray cluster. To start a Ray cluster,
  52. please refer to the `cluster setup <cluster/index.html>`__ instructions.
  53. To connect a scikit-learn to a running Ray cluster, you have to specify the address of the
  54. head node by setting the ``RAY_ADDRESS`` environment variable.
  55. You can also start Ray manually by calling ``ray.init()`` (with any of its supported
  56. configuration options) before calling ``with joblib.parallel_backend('ray')``.
  57. .. warning::
  58. If you do not set the ``RAY_ADDRESS`` environment variable and do not provide
  59. ``address`` in ``ray.init(address=<address>)`` then scikit-learn will run on a SINGLE node!