Bigger is not always better
LLMs like GPT-4o have really taken the world by storm and have transformed enterprises with their multimodal capabilities and ability to tackle highly complex tasks.
At the same time, smaller LLMs which we’ll dub SLMs (small language models), have gained significant developer traction due to their small size, low cost, and open source availability (e.g. Llama 3.1-8B by Meta). So much so that OpenAI released its own lower-cost SLM variant GPT-4o-mini on July 18th.
For many of you who are building GenAI driven applications, particularly for the enterprise, this begs the question: is bigger always better?
The answer is: it really depends on how you use the language models. I’m going to argue that for certain use cases, smaller language models with the right prompt engineering and fine tuning can give you a better price to performance tradeoff. We’ll demonstrate techniques that can help you get the most out of your smaller language models, namely:
Prompt engineering techniques
Parameter-efficient fine-tuning
With these tools under your belt, you’ll learn how to get an open source SLM, Llama3.1-8B, to outperform a proprietary LLM, GPT-4o, on both cost and quality.
A use case driven quality study: PII redaction
As for the use case to perform our evaluation on, we'll choose the task of Personally Identifiable Information (PII) redaction. PII redaction masks or removes sensitive content in the form of PII from documents, emails and transcripts. It’s a widespread task performed on enterprise data, particularly in regulated industries. Language models are a great fit for PII redaction because they can easily adapt to new categories of PII that traditional PII tools might not support out of the box. Yet they constitute an interesting use case for us, since some of the harder PII redaction use cases can pose a challenge for top-of-the-line LLMs like GPT-4o.
The language model performs PII redaction via tool-calling as shown in the diagram above. Tool calling, also referred to as function calling, allows language models to invoke external tools (e.g. query a database, perform a web-search, call an arbitrary function). If a language model is the brain of a system, tools are its limbs.
We’re using a labeled dataset from AI4Privacy available on HuggingFace. The dataset contains 54 PII classes (e.g. name, emails, credit card number etc.) and covers five interaction styles (e.g. casual conversation, formal document, emails etc.).
Round 1: Baseline evaluation of Llama 3.1 8B v GPT-4o
Before we dive into the specific optimization techniques aimed at improving the quality of Llama 3.1-8B, let’s explore how it stacks up against GPT-4o for this PII redaction task in terms of cost vs. accuracy.
For this baseline evaluation we use a well written system prompt that describes the task that the model needs to perform. This prompting technique is known as zero-shot prompting and is quite widespread in many initial LLM PoCs.
The results of this evaluation are listed in the leaderboard chart below. On accuracy, GPT-4o is a clear winner at 76.7% while Llama 3.1-8B wins on cost effectiveness by a long shot (17x lower cost).
While one could argue both language models are pareto optimal in terms of Accuracy vs. Cost, what we’ve seen really determines the success of GenAI PoCs going to production is quality. GPT-4o wins this round.
Round 1 Winner: GPT-4o, strictly based on higher quality results.
Round 2: Advanced prompt engineering
Many teams might stop there and use the results above to choose the higher quality model — despite the 17x higher inference cost.
However if you are a more savvy prompt engineer, you’ll know to evaluate few-shot prompting as a technique (among many others) to improve the overall quality of your results.
How does it work? The core idea is to provide the language model with examples on how to handle the task correctly as a complement to the task description. While many few-shot prompting tutorials place the examples in the system prompt, we found this approach to have limited success on our function calling example. Instead, we explore an approach that performs better for our use case: we build the example set in the list of messages that are passed to the LLM as exemplified next.
Let’s elaborate via the illustration below how the language model messages are being constructed for this PII redaction function calling use case.
With zero-shot prompting, we invoke the language model by passing in a system prompt and a user prompt. This is the most basic way to invoke the model’s chat completions API.
With single-shot prompting, we insert in the messages a full conversation round trip that includes:
an example user prompt (example text to be redacted),
the agent’s first response (language model determining what tool to call and what arguments to pass in the tool call, i.e. PII to redact),
the tool response (containing the correctly redacted text scrubbed from its PII),
and the agent’s second response (language model providing the correctly redacted text back to the user).
With few-shot prompting, we insert more than one full conversation round trip to supplement additional usage examples.
This technique delivers strong results:
We see a significant improvement in quality for all models when single shot and few shot prompting is in use. We also see a slight cost increase as a result of using single shot and few shot prompting. That’s to be expected since few shot prompting increases the number of prompting tokens that we feed into the language models.
An interesting takeaway from this evaluation is that an engineer who knows how to perform more sophisticated prompt engineering techniques (e.g. few-shot prompting) can get better quality results with Llama 3.1-8B vs. an engineer who uses basic prompt engineering (e.g. zero-shot prompting) on GPT-4o.
Round 2 Winner: GPT-4o with few-shot prompting wins on accuracy. Interestingly, Llama 3.1-8B with few-shot prompting is shown to outperform GPT-4o with zero-shot prompting, speaking to the effectiveness of prompt engineering techniques.
Round 3: Parameter-efficient fine-tuning
At this point, most teams will feel satisfied about the work they’ve put into prompt engineering — accuracy has increased a good deal without causing costs to balloon. With GPT-4o and few shot prompting, you can achieve 88.4% accuracy on this dataset, which is an improvement over where we started. But what if to hit production you’re required to hit a higher quality bar, e.g. 95%? At this point you might be required to explore more sophisticated techniques than just prompt engineering.
This is where fine-tuning comes in. Fine-tuning is a powerful technique that updates your language model weights in order to excel at a specific task. Specifically we’ll rely on parameter-efficient fine-tuning (PEFT) which produces a LoRA (as opposed to a model checkpoint). A LoRA is a very compact representation of a model fine-tune: it occupies several 10s of MBs of space (as opposed to GBs of space for checkpoints).
LoRAs are particularly useful when you want to deploy your fine-tune in a very cost effective way. Unlike model checkpoints which require a dedicated inference endpoint to serve the fine-tune, LoRAs can be served on a shared tenancy endpoint, drastically reducing the cost of inference to its end-users.
In general you should consider fine-tuning your language model according to the guidelines we’ve summarized for you below:
You’ve already attempted fairly sophisticated prompt engineering techniques and have hit a wall in getting the accuracy to improve further.
You plan on using your language model for a fairly specialized use case (e.g. classification, summarization, entity recognition, sentiment analysis etc.).
You have plenty of your own, high quality data to work with – ideally being as close to the data your language model will see in production.
You have an evaluation methodology to assess your language model’s accuracy to track improvements and regression on quality.
Methodology
We use OpenPipe fine-tuning service to fine-tune a Llama 3.1-8B model on a dataset of 10k samples split into 90% training and 10% validation sets. We deploy the LoRA on our OctoAI inference service and get a great cost structure for fine tuned inference: performing inference on a fine-tune doesn’t cost more than doing it on a base model, making fine-tuning a very interesting optimization for production workloads.
Finally, we run the evaluation of the fine-tuned Llama 3.1-8B on a holdout set that is not part of the fine-tuning dataset.
We wanted for the sake of fairness to do a comparison against GPT-4o fine-tunes. Unfortunately, the PII redaction dataset we attempted to train the model on was problematic according to OpenAI’s usage policies - it’s unclear why the dataset derived from AI4Privacy violates these policies.
The OpenAI policy error message, “The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies.”
Consequently we were unable to perform a complete comparison of Llama 3.1-8B fine-tuning vs. OpenAI GPT-4o fine-tuning (which just launched on Aug 20th 2024).
Let’s look at the results from fine-tuning.
The fine-tuned Llama 3.1-8B model outperforms GPT-4o with few shot prompting in terms of accuracy by over 8 percentage points. In other words, we’ve reduced the error rate from 11.6% down to 4%, that’s a 65.5% reduction in error rates, a critical improvement in a production setting.
In addition, cost-wise, we’ve managed to reduce costs by 32x compared to the best GPT-4o solution, a staggering amount of savings if you consider that at scale this a PII redaction system can get quite costly.
Assuming you process 1M entries a day using GPT-4o few-shot prompting, you’d be spending about $4,300 a day or $1.15M a year. With the more efficient Llama 3.1 fine-tune you’d be spending only $136 a day, or $50k a year.
The reasons for the low cost of Llama 3.1-8B fine-tuned model over Llama 3.1-8B zero-shot vanilla model is:
OctoAI doesn’t charge extra for serving LoRAs vs. base models.
The prompts going into the model are lot simpler due to the model being specifically trained for the task of PII redaction, thereby saving us on prompting tokens.
There is of course the cost of fine-tuning that we need to take into account. It costs about $20 to fine-tune a Llama 3.1-8B on a 10k PII redaction training set. This cost wasn’t included in our inference cost analysis since it’s a one-time cost. It would take under 4,600 GPT-4o inferences to recoup the costs of this fine-tuning task, so in a production setting the cost of fine-tuning is insignificant.
Round 3 Winner: Llama3.1 8B handily wins this round by winning both on accuracy and cost.
Conclusion
As an AI engineer, all efforts being equal, you will always get better quality results using a state of the art LLM like GPT-4o vs. a SLM like Llama 3.1-8B.
That being said, if you have a clear quality target, there are very effective tricks you can use to close the quality gap between LLMs and SLMs, ultimately unlocking order-of-magnitudes costs savings for your business.
We looked at a common task done in the enterprise – PII redaction – and saw that even this seemingly simple task can pose a challenge for state-of-the-art LLMs like OpenAI’s GPT-4o.
We summarize the findings of our quality optimization study in the table below in a qualitative way (using the very scientific method of emoji-based scoring):⭐️️💲🚀💪
It’s clear that parameter efficient fine-tuning delivers the best results on Llama 3.1-8B in terms of quality and costs. In terms of inference speed, running Llama 3.1-8B fine-tunes takes 2.8s on average vs. 2.5s for multi-shot prompting, due to LoRA batching overheads. Fine-tuning also does come at an engineering cost: you’ll have to build a dataset, define a scoring metric, run fine-tuning and quality evaluations. We talked to our developers and customers: while all understand the value of fine-tuning, many have been intimidated by how much lifting is necessary to achieve the results.
Thankfully we have a notebook that walks you through how to reproduce the fine-tuning results from this study, which should help demystifying the process of fine-tuning for you and your team: How_to_supercharge_Small_Language_Models.ipynb
You can also take a look at the data from this evaluation on Weights & Biases dashboards we’ve curated for you:
Dashboard 1: Comparing Llama3.1 Fine Tune vs. Llama3.1
Dashboard 2: Comparing Llama3.1 Fine Tune vs. GPT-4o
We presented an earlier version of this evaluation at a webinar on LLM cost and quality optimization: How to Optimize LLMs for Cost and Quality.