OctoAI Image Gen Solution introduces Photo Merge, allowing you to seamlessly integrate a photo’s subject into high-quality AI-generated output. It eliminates the need to create time-consuming custom facial fine-tunes with numerous tuning images and 15-30 minutes typically associated with SDXL LoRAs. OctoAI's Photo Merge simplifies this process, requiring only 1-4 images and delivering precise results within a few seconds. Businesses can now easily apply GenAI powered imagery for needs ranging from realistic CGI characters, to personalized product recommendations, to digital avatars.
Photo Merge can be accessed through the "transfer_images" parameter within OctoAI’s Image Generation API. This parameter accepts a key-value pair consisting of a trigger word and an array of up to 4 images. It operates exclusively with SDXL models and seamlessly harmonizes with style presets, controlnets, checkpoints, and LoRAs when utilized with SDXL models, thereby amplifying its adaptability and functionality.
In this post, we will walk you through how to use the Photo Merge functionality within OctoAI Image Gen API, utilizing the new transfer_images parameter. We also show how it compares to using custom fine-tuning.
Solution overview
For our use case, we are assuming the role of a Retail Media Marketing & eCommerce Advertising Platform. With access to a real model's photo, our goal is to seamlessly integrate it into a variety of products. First, we'll generate AI-powered images of our human model. For this, we will try both the traditional approach of creating a custom fine-tune with the human model’s images and the new approach of using Photo Merge functionality. We will compare the results and latency of the two approaches. Next, we will create a custom fine-tune for the products our model will represent and lastly, we will showcase the seamless integration of the model’s face with the corresponding product.
Workflow steps
Create a custom facial fine tune of the human model with 10-12 portrait images.
Leverage Photo Merge (transfer_image parameter in OctoAI SDXL Image Gen API) with only 1-4 images of the human model instead of custom facial fine tune created in step 1.
Compare the results between the two approaches.
Create custom fine tunes (SDXL LoRAs) for a retail product.
Integrate the human model images (generated in step 2) with the images of retail product (generated in step 4).
Prerequisites
For this walkthrough, make sure you have generated an OctoAI API token and have it set in your environment. You may use any of our supported languages: Python SDK, Typescript SDK, CLI or curl to avail OctoAI’s Image Generation API. Refer to our API documentation.
Walkthrough
Create custom fine tune (SDXL LoRAs) of a human model: We have taken 10 images of OctoAI’s CEO, Luis Ceze as our tuning image dataset.
Next, we will create a custom fine tune from OctoAI’s WebUI. Navigate to Image Generation → Tuning & Datasets.
Click the ‘+New Tune’ button to begin. Adjust fine-tuning settings and upload your images. This involves selecting a base checkpoint (in our case, default SDXL), assigning a trigger word (to customize images with your subject), and specifying the number of steps. A range of 400 to 1,200 steps generally yields optimal results. Upload the image dataset and submit the fine tuning job.
With 800 steps and 10 tuning images, it will approximately take between 20-30 mins to complete. Once it completes, let’s evaluate the effectiveness of our custom facial fine-tune.
Navigate to Image Generation → Image Tools and click on Text to Image tile card. Here, let’s use the following parameters:
It's evident that the output images don't entirely resemble our tuning dataset's human model, Luis Ceze. While the man in the output bears some resemblance to the tuning dataset, he doesn't closely resemble Luis.
Achieving a closer resemblance would require a larger tuning dataset (64-100 images) and/or increasing the number of steps, which would be both time and cost-intensive and not scalable.
Utilize Photo Merge feature: Let’s now try the new Photo Merge feature and compare the output results from both approaches. We’ll use the transfer_image parameter in OctoAI’s Image Gen API to show case this functionality.
Let us start with uploading 4 images of our human model — Luis Ceze.
Next, we utilize the transfer_images={"triggerword": list of images} parameter within the payload of OctoAI’s SDXL Image Gen API at https://image.octoai.run/generate/sdxl
.
In the given example, we employ the trigger word ‘luis’ and link it with the dataset comprising the four images mentioned earlier. Subsequently, we structure the prompt to incorporate the trigger word.
Prompt: A man luis sitting in a coffee shop.
The remaining parameters remain consistent with approach 1. It's worth noting that in this instance, no LoRA is utilized. Additionally, we utilize a checkpoint named ‘RealVisXL’, an OctoAI asset checkpoint specifically optimized for the Photo Merge feature. However, it's important to mention that the Photo Merge feature is functional even if the base SDXL checkpoint is utilized.
The request take approximately 8.8 secs and generates the following output:
Pretty accurate, isn’t it? Let’s try it with few different prompts and combine it with other style presets, LoRAs and checkpoints to confirm whether we consistently get the accurate results.
Let us use transfer_images parameter in conjunction with ‘Graffiti’ style preset. We are keeping all other parameter values similar to the payload above.
The request take approximately 8.7 secs and generates the following output:
Let’s now use transfer_images parameter with a pre-trained Style LoRA. We have already imported a pre-trained style based LoRA into OctoAI’s Asset Library. In the payload below, we are using the corresponding asset’s asset id and assigning it a weight of 1.0.
The request take approximately 16.9 secs and generates the following output:
You'll notice that the AI-generated images of our human model closely resemble his actual images. The results of PhotoMerge are significantly more precise and do not require the additional time of fine-tuning a custom LoRA for 20-30 minutes to achieve the desired outcome.
Comparison between custom fine tunes for faces (SDXL LoRAs) vs Octo AI’s Photo Merge
Approaches Tuning image dataset Steps in fine-tune Time for fine-tune Inference latency Results Custom Fine Tune for Face (SDXL LoRAs)
16-64
800-1,000
20-30 minutes, increasing linearly with more tuning data and num of steps
Few seconds
Poor to mediocre quality
Photo Merge
1-4
N/A
N/A
Few seconds
Precise and accurate
Approaches | Tuning image dataset | Steps in fine-tune | Time for fine-tune | Inference latency | Results |
---|---|---|---|---|---|
Custom Fine Tune for Face (SDXL LoRAs) | 16-64 | 800-1,000 | 20-30 minutes, increasing linearly with more tuning data and num of steps | Few seconds | Poor to mediocre quality |
Photo Merge | 1-4 | N/A | N/A | Few seconds | Precise and accurate |
Now that we've determined the best approach for generating accurate images of our human model, let's bring it all together. We'll create a custom fine-tune for our retail product and seamlessly integrate our AI-generated human model's image with it.
Create custom fine tunes (SDXL LoRAs) for retail product: The steps to create a custom fine tune are similar to what was showcased earlier in the blog. We will upload 10-12 images of our product, which in our case are different colored Lacoste polo shirts for men.
We will then create a custom fine tune by configuring the appropriate fine tuning parameters (as shown earlier), assign it a different trigger word and and upload our tuning dataset.
After approximately 20-30 mins, our custom LoRA fine tuned on our branded polo-shirts will be available.
We are now ready to bring everything together. Let us use transfer_images parameter (Photo Merge) to generate accurate images of our human model, Luis and apply ‘lacosteshirt-finetune’ LoRA to the shirt he is wearing.
Please note that "luis" serves as the trigger word associated with Luis’s images in transfer_images. We position this trigger word immediately after the subject, "man," enabling our human model to inherit the facial attributes of Luis’s images. Additionally, we input the asset ID of the custom LoRA tuned for Lacoste polo shirts for men, which is associated with the trigger word "lacosteshirt1." This trigger word is placed immediately after the word "T-shirt" in our prompt, ensuring that the required attributes are applied to the shirt.
The request takes seconds and generates the following output:
Voila! The generated output seems to seamlessly integrate our human model’s face - in this case, Luis's — with the corresponding product: a Lacoste pink polo T-shirt.
This blog showcases just one facet of OctoAI’s Photo Merge feature's possibilities. Photo Merge offers endless potential - whether in entertainment, gaming, marketing agencies, or fashion and retail sectors, it can help craft personalized avatars, advertisements, and brand ambassador representations. It can also enable virtual try-ons and lifelike digital product showcases. To learn more, refer to our documentation.
Get started using Photo Merge today
Sign up and try Photo Merge for free on the OctoAI Image Gen Solution today.
Please join us on Discord to engage with the team and our community. We’ll use the Discord channel to share about upcoming features, promotions and competitions. Stay tuned to learn more, and I look forward to see the applications and imagery you build using OctoAI Image Gen Solution.