What You Need To Know About Fine-Tuning LLM Models

“In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) have emerged as a cornerstone of linguistic computational ability. However, the true prowess of these models lies not just in their inherent design, but in their ability to be fine-tuned for specific applications and tasks.”

If it wasn’t immediately apparent, a human didn’t write that. It’s AI. It’s the kind of introduction you’ll get if you ask ChatGPT to “write an introduction for an article titled ‘What You Need To Know About Fine-Tuning Your LLM Models.’”

That introduction is… fine. But we wouldn’t use it if we weren’t making a point. It’s not great. “Fine, not great,” is a description that could apply to almost any output from the vast generalist LLMs.

This is understandable. ChatGPT is pushing 200 million users. It has to answer every one of their queries, whether it’s how to write a business email, a TypeScript function, or a love letter to their beau. They are the silicon epitome of a jack of all trades, master of none.

But if you are trying to build an AI-first product on top of these models, you need it to be a master. You want high-quality results along a narrow domain and don’t need the broad knowledge and abilities LLMs are built for.

This means building a better model tailored to your specific use case. You can do this from scratch, but that would be a terrible idea. The amount of data, time, and money you’d need to train and deploy a domain-specific LLM would be immense.

Instead, fine-tuning is the answer. Let’s go through what fine-tuning entails regarding LLMs and how to do it in practice to reduce costs and latencies while increasing the quality of your responses.

Fine-tuning and LLM for more fun and more profit

We’re going to build the world’s best recipe classifier. No one will be able to classify recipes as good as we by the end of this.

To do this, we will use OpenPipe to make the process easier. OpenPipe is a platform for managing fine-tuning across different models. We’ll use OpenPipe to ingest our training data, train our new model, evaluate its performance, and then run future inference.

Getting good data

Let’s start with the data. We will use a dataset from OpenPipe that consists of 5,000 unlabeled recipes. Working in Python, we’ll first install the datasets library from Hugging Face:

Then we’ll open up a new file called fine-tuning.py and load the dataset:

python

This will show us what the data looks like:

Delicious.

Importantly, this isn’t the data we’ll use to fine-tune our model. It is the data we’re going to use to generate the AI responses that we’ll use to fine-tune our model.

We are building a recipe classifier. Basically, from every recipe, we want to know:

Does it require an oven?
Does it require a stove?
Does it take over 30 minutes to prepare?
Does it contain meat?
Is it a main dish?

We can see why we want these. If we’re building a recipe app where users can input recipes for other users, then being able to classify them along these axes quickly means we can introduce filtering and search in our app more easily.

Doing so manually would be laborious. Asking users to do it adds friction. It’s an ideal task for AI, but we need an AI that can cook. We’re going to pass all 5,000 recipes through GPT-4 and get it to output its classification for each of the recipes. Those classifications will then be used to fine-tune our model.

Here’s the code to do this:

python

If you are interested in fine-tuning an LLM, you are well-versed in LLM API calls, and the code above will look familiar. But it does have a twist. Instead of using:

python

We are using:

python

This version of OpenAI is a wrapper around the core OpenAI API, but it allows you to log every request to use for fine-tuning automatically:

Request log view OpenPipe for fine-tuning use case

Here, we end up with 5,000 requests logged in OpenPipe. These are the inputs and outputs of the general LLM and are the data we’ll use to train our model.

Will a magical, structured, clean dataset exist for you to train your new model? Probably not. This leads to #1 of What You Need To Know About Fine-Tuning Your LLM Models.

#1. Save Your Data. In reality, you are going to be training with your current data. If you use the OpenPipe OpenAI wrapper, every prompt and response will automatically be logged. If you are not, consider ways to keep this data so that you can use it for fine-tuning later. Related is to consider going through this ever-growing dataset and picking out the best responses you find so that you are slowly building the high-quality dataset your fine-tuning process needs.

To give you an idea of timing and cost, it took about three and a half hours to run through these 5,000 requests to the GPT-4 model and cost just a little over $60.

The next step within OpenPipe is to create a dataset using our requests:

Here, we’re using all 5,000. But, again, reality is going to come back to bite you. Not every data point is going to be suitable for training. We aren’t expecting the general LLMs to do well, so you’ll likely have to prune your dataset to remove poor responses.

Training our next top model

Once we have our dataset, we can run the fine-tune. OpenPipe will automatically perform an 80:20 split on your dataset for train and test:

OpenPipe fine-tuning model GPT 3.5-turbo screenshot

Here, OpenPipe helpfully suggests our training set is “Considerable.”

This will be an essential factor in fine-tuning–the data you need. The more nuanced your model needs, the more data you’ll need to capture that nuance. The good news is that you can start experimenting with fine-tuning even small datasets and continue molding your model as your dataset grows.

Here, we’re choosing to train a GPT 3.5 Turbo base model. This is one of the main benefits of fine-tuning and #2 of What You Need To Know About Fine-Tuning Your LLM Models.

#2. Cost savings will be substantial. We can make smaller models perform better than larger models. We can take advantage of the lower costs and latencies of GPT 3.5 but still have high-quality responses. This is even more true for open-source models. The cost to run a Llama2-70b model on OctoAI is $0.0006 / 1K tokens for inputs and $0.0019 / 1K tokens for output. Above, we were running our recipes through GPT-4. The costs with OpenAI were $0.03 / 1K tokens for inputs and $0.06 / 1K tokens for output. So, there is an almost 50X difference in input pricing and a 30X difference in output pricing. Given our rough 7.5:1 input:output ratio in the above training, running all 5,000 recipes through a fine-tuned Llama2-70b would cost about $4 instead of $60.

Let’s hit “Start Training” and see what happens. There's not much within OpenPipe, but we’re running an OpenAI fine-tune behind the scenes.

Model training using OpenPipe showing GPT 3.5-turbo

Forty-five minutes and $24 later, and we have our model. The above graph shows how the training went. The sharp downward slope on the training loss suggests that errors in the model were removed reasonably quickly and converged on a relatively stable position. This means we really might not have needed all 5,000 responses.

This isn’t something you can easily know beforehand, leading to #3 of What You Need To Know About Fine-Tuning Your LLM Models.

#3: Fine-tune early and often. Don’t wait until you have a glut of data. You might only need double-digit training examples to start fine-tuning and triple-digit training examples to see response improvements. Try fine-tuning quickly and then iterate on your models.

But, more data is going to generally mean more better.

Is our model better?

Let’s quickly run our fine-tuned model. Again, we can run the model using OpenPipe, and doing so is as easy as replacing “gpt-4” in the OpenAI client with the name of your new model (the interestingly random “twelve-donkeys-kiss” here). Everything else stays the same:

python

The output is sound:

python

If we look at the request logs again, we can see two awesome things:

Logs of OpenPipe model comparison post fine-tuning results screenshot

First, the response is quicker. About 3X quicker. This is even without #4 in What You Need To Know About Fine-Tuning Your LLM Models.

#4. You can now cut down your inputs. The primary approach people use to bend LLMs to their will has been prompt engineering. If we look at our recipe prompt, it's pretty involved, and our input:output token ratio is 7.5:1, almost double the average 4:1. Once you have a fine-tuned model, the next step should be to pare down your prompts to reduce latencies and costs.

Those costs are also 10X lower here because we are using a cheaper model.

Evaluating a fine-tuned model comes down to speed, cost, and quality. Speed and cost are nicely numeric, so you can easily judge whether your model is performing efficiently.

Quality is annoyingly qualitative. Here, the main option is to look at side-by-side comparisons. For our recipe classifier, this is easier as we can see whether the five classifications line up. OpenPipe gives you a nice table view to perform this efficiently:

Side-by-side table showing comparisons OpenPipe

There is also the option to run some evaluative AI on your dataset. You can set instructions for evaluating the fine-tuning and test datasets. Here, we asked, “Which model’s output better matches the prompt?” The actual dataset just squeaked it, but effectively, it was a tie.

But consider #5 in What You Need To Know About Fine-Tuning Your LLM Models.

#5. It’s all about trade-offs. It would be great if the fine-tuned model were the world-beating recipe classifier we want. But it isn’t. Well, not yet. But it is on par with a much more extensive, much more expensive model. We’ve cut our costs 10X with no discernable drop in quality. If we wanted or needed a definite increase in quality, we could think about fine-tuning a better model, such as GPT-4. But then costs would increase again. So, you must always think about what you want from your fine-tuning. Lower costs, lower latencies, or higher quality?

I heard you like fine-tuning on your fine-tuning

The best thing about fine-tuning? It never ends. You fine-tune your base model, then you can fine-tune your fine-tune, then fine-tune the fine-tuning of your fine-tune.

The point is that if you have responses that can help your end users, i.e., high-quality and answers their needs, you can add that data to your dataset and keep tweaking your model to allow it to learn more and more and further narrow the focus.

And, importantly, this isn’t a zero-sum choice. You can get lower costs, lower latencies, and higher quality with fine-tuning. We didn’t get that here, but it is eminently achievable.

This is even truer if you move to open-source models. Then costs can dramatically fall while giving you lower latencies through better prompt engineering and higher-quality responses through good base models and fine-tuning.

Custom fine-tuned LLM checkpoints on OctoAI are in private preview today, and you can get in touch if you are interested in deploying your own fine-tuned open-source LLMs. If you aren’t at that stage yet, you can get started at no cost with the built-in LLMs available on the OctoAI Text Gen Service.