(serve-set-up-fastapi-http)=
This section helps you understand how to:
Serve offers a layered approach to expose your model with the right HTTP API.
Considering your use case, you can choose the right level of abstraction:
starlette.request.Requests
API.DAGDriver
with http_adapter
.(serve-http)=
When you deploy a Serve application, the ingress deployment (the one passed to serve.run
) will be exposed over HTTP.
```{literalinclude} doc_code/http_guide.py :start-after: begin_starlette :end-before: end_starlette :language: python
Requests to the Serve HTTP server at `/` are routed to the deployment's `__call__` method with a [Starlette Request object](https://www.starlette.io/requests/) as the sole argument. The `__call__` method can return any JSON-serializable object or a [Starlette Response object](https://www.starlette.io/responses/) (e.g., to return a custom status code or custom headers).
Often for ML models, you just need the API to accept a `numpy` array. You can use Serve's `DAGDriver` to simplify the request parsing.
```{literalinclude} doc_code/http_guide.py
:start-after: __begin_dagdriver__
:end-before: __end_dagdriver__
:language: python
Serve provides a library of HTTP adapters to help you avoid boilerplate code. The [later section](serve-http-adapters) dives deeper into how these works.
(serve-fastapi-http)=
If you want to define more complex HTTP handling logic, Serve integrates with FastAPI. This allows you to define a Serve deployment using the {mod}@serve.ingress <ray.serve.api.ingress>
decorator that wraps a FastAPI app with its full range of features. The most basic example of this is shown below, but for more details on all that FastAPI has to offer such as variable routes, automatic type validation, dependency injection (e.g., for database connections), and more, please check out their documentation.
```{literalinclude} doc_code/http_guide.py :start-after: begin_fastapi :end-before: end_fastapi :language: python
Now if you send a request to `/hello`, this will be routed to the `root` method of our deployment. We can also easily leverage FastAPI to define multiple routes with different HTTP methods:
```{literalinclude} doc_code/http_guide.py
:start-after: __begin_fastapi_multi_routes__
:end-before: __end_fastapi_multi_routes__
:language: python
You can also pass in an existing FastAPI app to a deployment to serve it as-is:
```{literalinclude} doc_code/http_guide.py :start-after: begin_byo_fastapi :end-before: end_byo_fastapi :language: python
This is useful for scaling out an existing FastAPI app with no modifications necessary.
Existing middlewares, **automatic OpenAPI documentation generation**, and other advanced FastAPI features should work as-is.
```{note}
Serve currently does not support WebSockets. If you have a use case that requires it, please [let us know](https://github.com/ray-project/ray/issues/new/choose)!
(serve-http-streaming-response)=
Support for HTTP streaming responses is experimental. To enable this feature, set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1` on the cluster before starting Ray. If you encounter any issues, [file an issue on GitHub](https://github.com/ray-project/ray/issues/new/choose).
Some applications must stream incremental results back to the caller. This is common for text generation using large language models (LLMs) or video processing applications. The full forward pass may take multiple seconds, so providing incremental results as they're available provides a much better user experience.
To use HTTP response streaming, return a StreamingResponse that wraps a generator from your HTTP handler.
This is supported for basic HTTP ingress deployments using a __call__
method and when using the FastAPI integration.
The code below defines a Serve application that incrementally streams numbers up to a provided max
.
The client-side code is also updated to handle the streaming outputs.
This code uses the stream=True
option to the requests library.
```{literalinclude} doc_code/streaming_example.py :start-after: begin_example :end-before: end_example :language: python
Save this code in `stream.py` and run it:
```bash
$ RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1 python stream.py
[2023-05-25 10:44:23] INFO ray._private.worker::Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(ServeController pid=40401) INFO 2023-05-25 10:44:25,296 controller 40401 deployment_state.py:1259 - Deploying new version of deployment default_StreamingResponder.
(HTTPProxyActor pid=40403) INFO: Started server process [40403]
(ServeController pid=40401) INFO 2023-05-25 10:44:25,333 controller 40401 deployment_state.py:1498 - Adding 1 replica to deployment default_StreamingResponder.
Got result 0.0s after start: '0'
Got result 0.1s after start: '1'
Got result 0.2s after start: '2'
Got result 0.3s after start: '3'
Got result 0.4s after start: '4'
Got result 0.5s after start: '5'
Got result 0.6s after start: '6'
Got result 0.7s after start: '7'
Got result 0.8s after start: '8'
Got result 0.9s after start: '9'
(ServeReplica:default_StreamingResponder pid=41052) INFO 2023-05-25 10:49:52,230 default_StreamingResponder default_StreamingResponder#qlZFCa yomKnJifNJ / default replica.py:634 - __CALL__ OK 1017.6ms
(serve-http-adapters)=
HTTP adapters are functions that convert raw HTTP requests to basic Python types that you know and recognize.
For example, here is an adapter that extracts the JSON content from a request:
async def json_resolver(request: starlette.requests.Request):
return await request.json()
The input arguments to an HTTP adapter should be type-annotated. At a minimum, the adapter should accept a starlette.requests.Request
type (https://www.starlette.io/requests/#request),
but it can also accept any type that's recognized by FastAPI's dependency injection framework.
Here is an HTTP adapter that accepts two HTTP query parameters:
def parse_query_args(field_a: int, field_b: str):
return YourDataClass(field_a, field_b)
You can specify different type signatures to facilitate the extraction of HTTP fields, including
For more details, you can take a look at the FastAPI documentation.
In addition to above adapters, you also use other adapters. Below we examine at least three:
Predictor
DAGDriver
FastAPI
ApplicationPredictor
Ray Serve provides a suite of adapters to convert HTTP requests to ML inputs like numpy
arrays.
You can use them together with the Ray AI Runtime (AIR) model wrapper feature
to one-click deploy pre-trained models.
As an example, we provide a simple adapter for an n-dimensional array.
When using model wrappers, you can specify your HTTP adapter via the http_adapter
field:
from ray import serve
from ray.serve.http_adapters import json_to_ndarray
from ray.serve import PredictorDeployment
serve.run(PredictorDeployment.options(name="my_model").bind(
my_ray_air_predictor,
my_ray_air_checkpoint,
http_adapter=json_to_ndarray
))
:::{note}
my_ray_air_predictor
and my_ray_air_checkpoint
are two arguments int PredictorDeployment
constructor. For detailed usage, please checkout Ray AI Runtime (AIR) model wrapper
:::
DAGDriver
When using a Serve deployment graph, you can configure
ray.serve.drivers.DAGDriver
to accept an HTTP adapter via its http_adapter
field.
For example, the json_request
adapter parses JSON in the HTTP body:
from ray.serve.drivers import DAGDriver
from ray.serve.http_adapters import json_request
from ray.dag.input_node import InputNode
with InputNode() as input_node:
# ...
dag = DAGDriver.bind(other_node, http_adapter=json_request)
FastAPI
ApplicationYou can also bring the adapter to your own FastAPI app using Depends. The input schema automatically become part of the generated OpenAPI schema with FastAPI.
from fastapi import FastAPI, Depends
from ray.serve.http_adapters import json_to_ndarray
app = FastAPI()
@app.post("/endpoint")
async def endpoint(np_array = Depends(json_to_ndarray)):
...
Serve also supports pydantic models as a shorthand for HTTP adapters in model wrappers. Instead of using a function to define your HTTP adapter as in the examples above, you can directly pass in a pydantic model class to effectively tell Ray Serve to validate the HTTP body with this schema. Once validated, the model instance will passed to the predictor.
from pydantic import BaseModel
class User(BaseModel):
user_id: int
user_name: str
# ...
PredictorDeployment.deploy(..., http_adapter=User)
# Or:
DAGDriver.bind(other_node, http_adapter=User)
Here is a list of adapters; please feel free to contribute more!
(serve-ndarray-schema)=
.. automodule:: ray.serve.http_adapters
:members: json_to_ndarray, image_to_ndarray, starlette_request, json_request, pandas_read_json, json_to_multi_ndarray