Compute Service

Advanced: build a container from scratch in Python

Contact us at to request access to the Compute Service.

You can use our CLI to easily create containers for any model written in Python. However, note that OctoAI is able to run any container with an HTTP server, so you are always welcome to build containers in your own ways (with the understanding that using custom containers means potentially longer cold start/ fewer optimizations).

If you prefer to create your own container from scratch, this tutorial will walk you through one example of how to do so.

  • In this example, we will build a container for a Flan-T5 small model from the Hugging Face transformers library. This model is commonly used for text generation and question answering, but note that because it’s small it does not yield outputs that are as high-quality as other OctoAI LLM endpoints.
  • An equivalent example for Hugging Face diffusers models can be found in the same GitHub repo.


Example code

All the code in this tutorial can be found at this GitHub repo.

Step-by-step walkthrough

Prepare Python code for running an inference

First, we define how to run an inference on this model in The core steps include initializing the model and tokenizer using the transformers Python library, then running a predict() function that tokenizes the text input, runs the model, then de-tokenizes the model back into a text format.
1"""Model wrapper for serving flan-t5-small."""
2import argparse
3import typing
5import torch
6from transformers import T5ForConditionalGeneration, T5Tokenizer
8_MODEL_NAME = "google/flan-t5-small"
9"""The model's name on HuggingFace."""
11_DEVICE: str = "cuda:0" if torch.cuda.is_available() else "cpu"
12"""Device on which to serve the model."""
14class Model:
15 """Wrapper for a flan-t5-small Text Generation model."""
17 def __init__(self):
18 """Initialize the model."""
19 self._tokenizer = T5Tokenizer.from_pretrained(_MODEL_NAME)
20 self._model = T5ForConditionalGeneration.from_pretrained(_MODEL_NAME).to(
22 )
24 def predict(self, inputs: typing.Dict[str, str]) -> typing.Dict[str, str]:
25 """Return a dict containing the completion of the given prompt.
27 :param inputs: dict of inputs containing a prompt and optionally the max length
28 of the completion to generate.
29 :return: a dict containing the generated completion.
30 """
31 prompt = inputs.get("prompt", None)
32 max_length = inputs.get("max_length", 2048)
34 input_ids = self._tokenizer(prompt, return_tensors="pt")
35 output = self._model.generate(input_ids, max_length=max_length)
36 result = self._tokenizer.decode(output[0], skip_special_tokens=True)
38 return {"completion": result}
40 @classmethod
41 def fetch(cls) -> None:
42 """Pre-fetches the model for implicit caching by Transfomers."""
43 # Running the constructor is enough to fetch this model.
44 cls()
46def main():
47 """Entry point for interacting with this model via CLI."""
48 parser = argparse.ArgumentParser()
49 parser.add_argument("--fetch", action="store_true")
50 args = parser.parse_args()
52 if args.fetch:
53 Model.fetch()
55if __name__ == "__main__":
56 main()

Create a server

Next, we wrap this model in a Sanic server in Sanic is a Python 3.7+ web server and web framework that’s written to go fast. In our server file, we define the following:

  • A default port on which to serve inferences. The port can be any positive number, as long as it’s not in use by another application. 80 is commonly used for HTTP, and 443 is often for HTTPS. In this case we choose 8000.
  • Two server routes that OctoAI containers should have:
    • a route for inference requests (e.g. “/predict”). This route for inference requests must receive JSON inputs and JSON outputs.
    • a route for health checks (e.g. “/healthcheck”). See Healthcheck path in custom containers for a detailed explanation.
  • Number of workers (not required). Typical best practice is to make this number some function of the # of CPU cores that the server has access to and should use.
1"""HTTP Inference serving interface using sanic."""
2import os
4import model
5from sanic import Request, Sanic, response
8"""Default port to serve inference on."""
10# Load and initialize the model on startup globally, so it can be reused.
11model_instance = model.Model()
12"""Global instance of the model to serve."""
14server = Sanic("server")
15"""Global instance of the web server."""
17@server.route("/healthcheck", methods=["GET"])
18def healthcheck(_: Request) -> response.JSONResponse:
19 """Responds to healthcheck requests.
21 :param request: the incoming healthcheck request.
22 :return: json responding to the healthcheck.
23 """
24 return response.json({"healthy": "yes"})
26@server.route("/predict", methods=["POST"])
27def predict(request: Request) -> response.JSONResponse:
28 """Responds to inference/prediction requests.
30 :param request: the incoming request containing inputs for the model.
31 :return: json containing the inference results.
32 """
33 inputs = request.json
34 output = model_instance.predict(inputs)
35 return response.json(output)
37def main():
38 """Entry point for the server."""
39 port = int(os.environ.get("SERVING_PORT", _DEFAULT_PORT))
40"", port=port, workers=1)
42if __name__ == "__main__":
43 main()

In our toy example, the line model_instance = model.Model() executes first, so by the time the server is instantiated our model is ready. That is why the code in our “/healthcheck” route is very straightforward in this example. In your own container, make sure your “/healthcheck” returns 200 only after your model is fully loaded and ready to take inferences.

Package the server in a Dockerfile

Now we can package the server by defining a requirements.txt file and a Dockerfile:

1FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
8RUN apt update && \
9 apt install -y python3-pip
11# Upgrade pip and install the copied in requirements.
12RUN pip install --no-cache-dir --upgrade pip
13ADD requirements.txt requirements.txt
14RUN pip install --no-cache-dir -r requirements.txt
16# Copy in the files necessary to fetch, run and serve the model.
17ADD .
18ADD .
20# Fetch the model and cache it locally.
21RUN python3 --fetch
23# Expose the serving port.
26# Run the server to handle inference requests.
27CMD python3 -u

Along with installing the dependencies, the Dockerfile also downloads the model into the image at build time. Because the model isn’t too big, we can cache it in the Docker image for faster startup without impacting the image size too much. If your model is larger, you may want to pull it on container start instead of caching it in the Docker image. This may affect your container startup time, but keeps the image itself slim.

Build a Docker image using the Dockerfile

$$ DOCKER_REGISTRY="XXX" # Put your Docker Hub username here
>$ cd ./flan-t5-small
>$ docker build -t "$DOCKER_REGISTRY/flan-t5-small-pytorch-sanic" -f Dockerfile .

Test the image locally

Run this Docker image locally on a GPU to test that it can run inferences as expected:

$$ docker run --gpus=all -d --rm
> -p 8000:8000 --env SERVER_PORT=8000
> --name "flan-t5-small-pytorch-sanic"
> "$DOCKER_REGISTRY/flan-t5-small-pytorch-sanic"

..and in a separate terminal run the following command one or more times

$$ curl -X GET http://localhost:8000/healthcheck

… until you see {"healthy":true} in the terminal output. Now, you can get an inference by running:

$$ curl -X POST http://localhost:8000/predict \
> -H "Content-Type: application/json" \
> --data '{"prompt":"What state is Los Angeles in?","max_length":100}'

Push the image to a cloud registry

Push your Docker image to Docker Hub with:

$$ docker push "$DOCKER_REGISTRY/flan-t5-small-pytorch-sanic"

Now that you have your container, create an endpoint to establish your endpoint on OctoAI. )