Advanced: build a container from scratch in Python
Contact us at customer-success@octo.ai to request access to the Compute Service.
You can use our CLI to easily create containers for any model written in Python. However, note that OctoAI is able to run any container with an HTTP server, so you are always welcome to build containers in your own ways (with the understanding that using custom containers means potentially longer cold start/ fewer optimizations).
If you prefer to create your own container from scratch, this tutorial will walk you through one example of how to do so.
- In this example, we will build a container for a Flan-T5 small model from the Hugging Face
transformers
library. This model is commonly used for text generation and question answering, but note that because it’s small it does not yield outputs that are as high-quality as other OctoAI LLM endpoints. - An equivalent example for Hugging Face
diffusers
models can be found in the same GitHub repo.
Prerequisites
- Sign up for a Docker Hub account
- Download Docker desktop on your local machine
- Authenticate the Docker CLI on your machine
Example code
All the code in this tutorial can be found at this GitHub repo.
Step-by-step walkthrough
Prepare Python code for running an inference
First, we define how to run an inference on this model in model.py
. The core steps include initializing the model and tokenizer using the transformers
Python library, then running a predict()
function that tokenizes the text input, runs the model, then de-tokenizes the model back into a text format.
Create a server
Next, we wrap this model in a Sanic server in server.py
. Sanic is a Python 3.7+ web server and web framework that’s written to go fast. In our server file, we define the following:
- A default port on which to serve inferences. The port can be any positive number, as long as it’s not in use by another application. 80 is commonly used for HTTP, and 443 is often for HTTPS. In this case we choose 8000.
- Two server routes that OctoAI containers should have:
- a route for inference requests (e.g. “
/predict
”). This route for inference requests must receive JSON inputs and JSON outputs. - a route for health checks (e.g. “
/healthcheck
”). See Healthcheck path in custom containers for a detailed explanation.
- a route for inference requests (e.g. “
- Number of workers (not required). Typical best practice is to make this number some function of the # of CPU cores that the server has access to and should use.
In our toy example, the line model_instance = model.Model()
executes first, so by the time the server is instantiated our model is ready. That is why the code in our “/healthcheck” route is very straightforward in this example. In your own container, make sure your “/healthcheck” returns 200 only after your model is fully loaded and ready to take inferences.
Package the server in a Dockerfile
Now we can package the server by defining a requirements.txt file and a Dockerfile:
Along with installing the dependencies, the Dockerfile also downloads the model into the image at build time. Because the model isn’t too big, we can cache it in the Docker image for faster startup without impacting the image size too much. If your model is larger, you may want to pull it on container start instead of caching it in the Docker image. This may affect your container startup time, but keeps the image itself slim.
Build a Docker image using the Dockerfile
Test the image locally
Run this Docker image locally on a GPU to test that it can run inferences as expected:
..and in a separate terminal run the following command one or more times
… until you see {"healthy":true}
in the terminal output. Now, you can get an inference by running:
Push the image to a cloud registry
Push your Docker image to Docker Hub with:
Now that you have your container, create an endpoint to establish your endpoint on OctoAI. )