Fine-tuning Stable Diffusion
Create custom assets with OctoAI's fine-tuning of Stable Diffusion models.
OctoAI lets you fine-tune Stable Diffusion to customize generated images. Fine-tuning is a process of training a model with additional data for your task. There’s a few simple steps:
- Configure fine-tuning settings & upload your images
- Run the fine-tuning job
- Use the fine-tuned asset (a LoRA) in your image generation requests
We’re using the LoRA fine-tuning method, which is an acronym for Low-Rank Adaptation. It’s a fast and effective way to fine-tune Stable Diffusion. Fine-tuning is supported for Stable Diffusion v1.5 and Stable Diffusion XL. Fine-tuning is available in the OctoAI web UI or via the fine-tuning API.
Web UI Guide
In the web UI, navigate to the Tuning & Datasets page from the Media Gen Solution menu to get started - any previously tuned models will also be listed here. Click on “New Tune” to continue.
Configure settings
Specify the name of your fine-tune, the trigger word of the subject you’re fine-tuning, and the base checkpoint. The base checkpoint can be the default Stable Diffusion v1.5 checkpoint, default Stable Diffusion XL checkpoint, or any custom checkpoint.
The trigger word can be used in your inference requests to customize the images with your subject. We generally recommend using a unique trigger word, such as “sks1”, that’s unlikely to be associated with a different subject in Stable Diffusion. Alternatively, you can use an existing concept as the trigger word value - such as “in the style of a cartoon drawing” - to update Stable Diffusion’s understanding of that concept.
Then, specify the number of steps to train. A range of 400 to 1,200 steps works well in most cases, and a good guideline is about 75 to 100 steps per training image. The model can underfit if the numer of training steps is too low, resulting in poor quality. If it’s too high, the model can overfit and struggle to represent details that aren’t represented in the training images.
Upload images & tune
Next, upload your training images and start tuning. We recommend using varied images, including different backgrounds, lightings, and distances. Finding a balance between variation and consistency can help improve image generation quality. All uploaded images used for fine-tuning must comply with our terms of service.
Optionally, you can provide captions for each image that describe the custom subject. This can help improve fine-tuning and the quality of generated images. Make sure to include your trigger word in the caption.
When you’re ready, click “Start Tuning”, and the fine-tune job will progress from pending to running before completing.
Generating images
When complete, the fine-tuned asset is stored in your Asset Library and available for image generation. You can launch the Text to Image or Image to Image tool to start generating images with your custom asset.
API Guide
Complete fine-tuning API parameters are organized in our API Reference documentation.
Upload images
First, upload your training images using the AssetLibrary Python Client or CLI.
Python client
You can easily upload individual image files, or a folder with multiple files. Here’s an example uploading the image1.jpeg
file with the name image1
from the file path finetuning_images
:
You’ll receive a response with the asset ID, name, and status:
Here’s an example uploading a folder of images. This code snippet gets the files in the folder named finetuning_images
, then splits on the .
to get the files_format extension (jpg, jpeg, or png). Then the file names are used to set the asset names:
The final print(asset)
will return a response with each asset ID, name, and status:
CLI
Alternatively, you can upload images using the OctoAI CLI. Here’s the CLI command using the same image1.jpeg
example:
Configure settings & tune
Next, create your fine-tune. In the examples in this section, we’re fine-tuning Stable Diffusion XL with images of a bulldog. We specify the base checkpoint, trigger word, training steps, and fine-tune name. Also included are the individual training images and corresponding captions. We recommend including captions, describing the context of the subject, to improve fine-tuning quality. Be sure to include your trigger word within the caption.
Python Client
REST API
Full API details and parameters are available here. Using the continue_on_rejection
boolean parameter, you can optionally continue with the fine-tune job if any of the training images are identified as NSFW.
You’ll receive a response that includes the tune ID, which you can use to monitor the status of the job.
Monitor fine-tuning status
You can check the status of a fine-tune job by running a GET on the tune ID:
Generating images
When complete, the fine-tuned asset is stored in your Asset Library and available for image generation. You can generate images by including the LoRA in your image generation request.
General fine-tuning tips
Image captions
Image captions can improve quality by providing additional context and details to the trained model. Whenever possible, we suggest using image captions - and be sure to include your trigger word in each caption.
You can also include the subject class - such as person, animal, or object - to improve quality. As an example, you could fine-tune images of a specific bulldog with an example caption sks1 bulldog playing at the beach
where sks1
is the trigger word and bulldog
is the subject class.
Image variation
We recommend some amount of variation in your images. If every image is close-up, the fine-tuned model may be limited to representing that distance. It’s also helpful to have some level of consistency among the images to ensure the model learns the intended subject. Finding the right balance between consistency and variation can require a few iterations, and we encourage you to experiment!
Managing fine-tunes and assets
You can separately manage a fine-tune job, and the corresponding LoRA created by a fine-tune. Deleting a fine-tune job won’t automatically delete the training images nor LoRA created during fine-tuning, and vice versa. Additionally, fine-tune names must be unique. You may encounter an error if you try to create a fine-tune with a duplicate name.
Fine-tuning duration
Fine-tuning Stable Diffusion v1.5 usually takes about 3-7 minutes, and Stable Diffusion XL takes about 10-20 minutes. Increasing the number of training steps will extend the fine-tuning duration.