Live Webinar

September 17 - Join our panel of experts and learn how to create AI Agents for the Enterprise

Back to demos & webinarsBuild a GenAI video generation pipeline

LibrariesLangChain Python SDK, MoviePy

ModelsNous Hermes 2 Mixtral, Mixtral (in JSON mode), SDXL, SVD 1.1

Date PublishedApr 4, 2024

PublisherThierry Moreau

Build a GenAI video generation pipeline

If you want an alternative to, Sora, learn how to generate your own 1 minute long videos from simple text prompts using open source models on OctoAI, all for under $3. I started with the idea to name any dish (real or fictional) and the pipeline generates a video showing users how to prepare and cook the dish.

See the Jupyter Notebook

AI generated bowl of doritos chips and dip

Prerequisites

An OctoAI account

You can deploy all the models yourself on required hardware, but this could be a long and cumbersome task. The simplest way to get started with all these models is by creating an OctoAI account, where you get $10 free credit upon sign up.

Sign up or Log in to OctoAI
Create an OctoAI API key

Jupyter notebook

You will need a place to run your code and the easiest is to launch a jupyter notebook on your local machine or on Google Colab. See our shared notebook.

Install ImageMagick and the following pip packages:
1. octoai-sdk
2. langchain
3. openai
4. pillow
5. ffmpeg
6. moviepy

Overview

I thought of the idea: name any dish (real or pretend) and a pipeline generates a video showing the ingredients, how to prepare and cook the dish. The videos needed to be high quality and factual, so they could be ready for TikTok, YouTube, or your favorite video platform.

The initial results were pretty good, even for this out of the box recipe of Doritos consommé, which is actually a real dish!

To make this all come together, I will need to build a pipeline of models specialized for the outcome. See all the models used and their purpose:

Nous Hermes 2 Mixtral 8x7B: to generate a recipe from the name of a dish
Mixtral 8x7B in JSON mode: to take the recipe and put into a structure JSON format by certain fields: recipe title, prep time, cooking time, level of difficulty, ingredients, and instruction steps
SDXL: to create images of the each ingredient, each cooking step, and the final dish
Stable Video Diffusion 1.1: to animate each image into a short 4 second video

Lastly, I stitch all the video clips together using MoviePy, and add subtitles and a human generated soundtrack for the full length video. Let's get started.

#1 Recipe generations with LangChain

I will use Nous Hermes 2 Mixtral by utilizing a popular library, LangChain Python SDK, and OctoAI LLM endpoint. Simple add the following to your Python script:

python

Next, instantiate your OctoAIEndpoint LLM by passing in under the model_kwargs dictionary the Nous Hermes 2 Mixtral model, or if you prefer another model, and the maximum number of tokens.

Now, define your prompt template. You want to provide enough rules to get the LLM to create a recipe with the right amount of information and detail. This is important because this text is used in the next generation steps for images and the videos.

Lastly, you should instantiate an LLM chain by passing in the LLM and the prompt template you created. The chain is ready to be invoked by passing in the user input: the name of the dish to generate a recipe. Let's take a look at the code:

python

Let's test this out by providing the dish name "Doritos consommé", and I should get the following output recipe:

#2 Structured output formatting (JSON) with OpenAI SDK

Now that we have a recipe we need to create the associated media (images, videos, captions). The formatting we have now is not very helpful because it is too difficult to parse with all the lists, bullets, etc. The best solution is to get the recipe into a JSON object format, which can be processed in a defined way for each detail: ingredients, instructions, and the metadata.

Start by defining a Pydantic class from which to derive a JSON object. The high level structure should look like:

A dish_name field (string) - name of the dish
An ingredient_list field (List[Ingredient]) - lists ingredients. Each Ingredient contains an ingredient field (string) that describes the ingredient and an illustration field (string) that describes a visual per ingredient.
A recipe_steps field (List[RecipeStep]) - lists the recipe steps. Each RecipeStep contains a step field (string) which describes the step and an illustration field (string) that describes a visual for that step.
A prep_time field (int) - prep time in minutes
A cook_time field (int) - cooking time in minutes
A difficulty field (string) - difficulty rating of the recipe

Since lots of developers like to use the OpenAI SDK, but we can easily use popular OSS models since OctoAI's API is compatible. You simply need to override OpenAI's base URL and API key, see below:

python

Next, when you instantiate your chat completion instance set the model to mixtral-8x7b-instruct. Then we can pass the Recipe Pydantic class defined above as our response format constraint.

python

The code for the recipe in JSON mode using the OpenAI SDK overridden to invoke Mixtral 8x7B.

python

Running the code should produce the below JSON output:

json

Yes! now you can generate the needed media: images, video, and captions.

#3 Generate images with SDXL

The JSON objects provides a strait-forward way to generate an image for each ingredient and recipe step. To create the images we are going to use SDXL and invoke it using the OctoAI Python SDK.

Instantiate the OctoAI ImageGenerator with the OctoAI API token, then invoke the generate method for all the images you want to create. Pass the following in the arguments:

engine, selects what model to use, SDXL
prompt, describes the image to create
negative_prompt, provides parameters and attributes we do not want in the images
width, height, specify a resolution for images
sampler, used in every denoising step. Read more here.
steps, set the number of denoising steps for images
cfg_scale, specifies the configuration scale, which defines how close to adhere to the prompt
num_images, states the amount of images to generate all at once
use_refiner, when on allows use of the SDXL refiner model to enhance image quality
high_noise_frac, states the ration of steps to perform on base model (SDXL) vs refiner model
style_preset, sets the type of preset style to apply to both negative and positive prompts, learn more here.

Learn about all the options for the OctoAI Media Gen API.

Now the code should look something like this:

python

Now we should have many still images to work with to create our full video.

#4 Animate still images with Stable Video Diffusion

Now to use Stable Video Diffusion 1.1 to animate each of our still images. It is an open source model, but OctoAI and Stability AI's partnership let's you use SVD 1.1 on OctoAI commercially.

Using the OctoAI Python SDK to animate the images you need to instantiate the OctoAI VideoGenerator with your OctoAI API token, then invoke generate for each animation you want to make. Pass the following arguments:

engine, selects the model to use - SVD
image, encodes the input image as base64 string
steps, sets the number of denoising steps for each video frame
cfg_scale, sets the configuration scale that defines how close to adhere to the image description
fps, sets the numbers of frames per second
motion_scale, sets how much motion to include in the animation
noise_aug_strength, sets how much noise to add to the initial images, and a higher value outputs more creative results
num_video, how many animation outputs to make

Learn more about the API and other options supported by OctoAI.

Let's take a look at the code for this:

python

It takes 30 seconds to create each 4 second animation. Since it is creating each video sequentially, this might take a few minutes to complete. You can simple extend this code to make it asynchronous or parallelize it.

#5 Create a full length video with MoviePy

Using the MoviePy library we can make a montage of the videos.

For every animation, we have corresponding text that goes with it from the recipe_dict JSON object. So, we can use this to create a montage of captions.

Now to put it all together in a user friendly way. All the animations have 25 frames, and are 6FPS animations lasting about 4.167 seconds. But, our ingredients list can get long, we should edit the videos to only be 2 seconds long to keep the overall video flow moving. For the steps portion of the video we play 4 seconds of each animation because the user needs time to easily read the directions.

The code below does these 3 things:

Stitches together the animations in this order: final dish, all ingredients, all instructions, and ending with final dish being cooked.
Adds subtitles throughout the video so there are easy to follow instructions
Adds a soundtrack to the video to delight users

python

Open the "collage_sub_sound.mp4" to see the full video on how to make Doritos consommé. Check out our video of "skittles omelette" too.

GenAI still image from skittles omelette pretend recipe

Conclusion

Sora is not yet mainstream, but if you get creative you already have the building blocks to create highly usable GenAI media today.

How did your video turn out? What recipe did you try to create? Feel free to show us in Discord. We are looking forward to see what you build.