In recent years, large language models (LLMs) have captivated our imaginations with their seemingly magical capabilities. However, as the novelty of chat-with-your-PDF-style apps begins to wane, organizations are now focused on the practical question: “How do we turn this technology into a profitable asset?” Or perhaps more realistically, “How do we utilize this tech without incurring exorbitant costs?” (For answers to the latter, consider exploring OctoStack). In this blog post, however, we'll address a different challenge: seamlessly integrating GenAI into broader organizational processes.
One of the key hurdles in extracting ROI from LLMs is embedding them into existing workflows. LLMs have the potential to automate and enhance tasks traditionally performed by humans or specialized, often fragile software. Yet, the path to effective integration is fraught with difficulties. LLMs sometimes struggle to follow instructions precisely, and parsing their outputs into usable formats can be a significant challenge.
This blog offers a solution to this problem: structured outputs. Utilizing JSON mode on OctoAI with models like Mistral 7B or Llama 3 provides an easy, cost-efficient method to integrate GenAI. We'll demonstrate the immense value of this feature through a practical example—converting free-form e-commerce feedback into Jira tickets. This functionality can save hours of labor for Product Managers and specialists, enabling them to focus on higher-impact projects.
LLM capabilities beyond chatbots
The gap between the capabilities of LLMs and their current use cases, along with the hype surrounding GenAI belies the untapped opportunity to automate mundane, often internal-facing, tasks. LLMs are an automation powerhouse. Although LLMs have limitations that make them unsuitable for business-critical tasks at present, not all tasks fall into that category. For instance, using an LLM to automate decision-making in diagnosing patients or determining a marketing budget would be unwise. However, these same LLMs can be used to automate form parsing, summarize reports, or enrich customer support tickets with information from internal documents in an easily digestible format.
These models excel at data mining, knowledge extraction, reformatting, and data enrichment. Take customer support tickets, for example: imagine transforming free-form user complaints into structured data, adding context, extracting necessary details, and saving it in your internal issue tracking system. Tasks that once required years and numerous models and specialized parsers can now be accomplished in weeks using a single LLM.
I’m painting such a rosy picture here, talking about “infinite possibilities”, but the reality isn’t that. So, what’s stopping us from materializing this vision?
The roadblocks ahead
Many organizations and software developers already know that operationalizing and productizing LLMs is challenging. High costs, subpar latency, limited interpretability, data quality issues, hallucinations, and models being jailbroken are some of the toughest problems when moving LLMs from demos to production. But there’s another issue that we’ve heard as particularly frustrating — the unreliable nature of LLM outputs.
For example, try prompting an LLM to give you an answer in a very specific format, whether it's SQL or a plain list with exactly three bullet points and a prefix. Run this prompt a dozen times, and you'll notice that the outputs occasionally don’t follow the schema. At best, they add or omit formatting. At worst, they include additional “rationale” without being asked. You might think, “So what?”. Well, this inconsistency is a dealbreaker for any system looking to integrate an LLM, unless you’re using the LLM output as the final system output, chatbot-style.
Integration is the crucial, often overlooked aspect of any software system. Few software systems can cover an entire vertical with minimal integration needs because real-world applications require interaction. No software is an island, and Generative AI models are software too. To unlock their full potential, we need a reliable way to integrate LLMs and GenAI models into our existing processes. However, as mentioned, this is no easy task because LLMs often struggle to follow instructions consistently.
Warning: technical details ahead! All the most popular LLMs are autoregressive language models, including OpenAI GPT family, Meta Llama, Mistral, etc. Let me explain. First, the language model part — these models try to learn how language works and generate text based on what they learned. What about autoregressive? The autoregressive part refers to how the text is generated, namely by taking as input all the provided text, outputting the next single token (word), and then passing in all the provided text plus the newly generated text again to generate the next token. And so on until it’s done. Why should you bother? This means the decision about what new token to generate depends on all the preceding tokens. Couple this with the complexity of even the smaller LLMs, and you get yourself a system where slight variations in the inputs may entirely change the outputs, and the longer the text sequences you generate, the more likely it will slip and generate so-called hallucinations. And I’m not even talking about various sampling strategies that exacerbate the problem and make it non-deterministic. So yeah, it’s hard to treat LLMs as reliable software components.
Are we doomed then?
The solutions
There are two general ways to approach eliciting structured output from these autoregressive language models. Let’s start by breaking down the most common method invoked in chat interfaces: few-shot prompts with examples of the target output.
A classic solution to make an LLM more likely to follow the desired schema relies on multiple tricks like:
A few-shot prompt showing examples of free-form text inputs and the desired structured outputs the model should output.
If the schema the LLM should follow is not traditional JSON or YAML, special parsing functions will also be required.
Sometimes, it also helps to “start” LLM’s response with the correct beginning of the schema. This technique is called priming.
Finally, if there’s a parsing or a validation error, you should retry the text generation function call/request. You could also pass the validation error message to help the model regenerate correctly.
Depending on the complexity of the desired schema, you may have to retry the same request multiple times. The increased input size, because of the multiple examples and potentially multiple retries per request, will substantially degrade your latency and will increase the cost, too. Not nice. Also, and I can’t stress this enough - the more complex the schema, the more time you’ll have to retry because the model will be more likely to mess up.
Empirically, a larger LLM will follow the constraints and instructions better, so fewer examples and retries will be required on average. Still, there will be a non-null probability of failure. Plus, if your overall task is so simple that a 7B parameter model can do it well enough, why should you have to query a more expensive model just to get correct JSONs more often out of it? Now think of what a 7B fine-tuned model could do — compete with GPT-4 on specific tasks and be integratable into your system, for example. Do you see just how foolish it would be to have to rely on big generalist models just to make it easier to integrate LLMs into your software?
So, the method described above is fragile and expensive. Now for the good way—use structured outputs. Structured output support is essentially a custom sampling technique that follows the provided schema to the letter, literally. It works by weaving the schema boilerplate into the generation loop, resulting in a fast and correct way to generate a valid output. No retries, no clever prompting, no fragile parsing.
Now, let’s move on to the demo, where I will bake my arguments with numbers.
Demo time
First, just look at the difference in the lines of code of each approach… and latency… and cost.
Latency (.5/.95 percentile) | Num of tokens sent (.5/.95 percentile) | Cost per run (.5/.95 percentile) | |
---|---|---|---|
OctoAI JSON Mode | 1.05s/2.55s | 390/590 | 0.000069/0.0001 |
DIY Structured Outputs | 1.12s/6.4s | 465/1850 | 0.000082/0.00036 |
All these results were obtained by tackling one of the most mundane but necessary tasks for everyone involved in Product Management and Customer Experience — writing Jira tickets (or insert your issue tracker here).
The latency, cost, and number of input tokens are all per request. This demo processed 120 reviews for ten products and wrote 15 Jira tickets in about 5 minutes using Llama 3 8B Instruct. Notice that the DIY method sometimes uses substantially more tokens. This happens because the request needs to be retried if it has validation or parsing errors. In fact, the DIY method failed to process about 13% of these reviews before retrying three times, making the whole process over 40% slower.
Anyway, I bet a product manager will need at least 20x the time to go over all these reviews and formulate the tickets.
So, here’s how the demo works, high level:
We convert the reviews into JSON, additionally labeling them with the specific category and whether the review is positive or negative
We then filter only the negative reviews
We then group together the issues that refer to the same problem
Finally, we summarize these issue groups into a single ticket and “submit” it to Jira
We simplified a bit—it’s a demo, after all. We used magazine reviews from Amazon as our source of customer feedback. We also mocked our API requests to Jira. Jira Cloud and Data Center, as well as Jira competitors, have REST APIs that you can use to integrate structured output into your automations and workflows.
Using the retrieval augmented generation (RAG) paradigm, you can further contextualize each of these tickets with information and links from Confluence, other Jira tickets, internal docs, and even chats. Really, the sky, and your internal tooling budget, is the limit. With all this added context, Jira could bring many more tangible benefits. Your team will be able to go from ticket to solution much faster. All this curated and contextualized information can enable your Product and Customer Experience teams to run feedback analytics and reduce resolution time. High-quality and easy-to-run feedback analytics can help your organization make more data-driven decisions and focus on high-impact projects with high confidence. I know this sounds like a very aggressive sales pitch for some CX/Product analytics startup, but we’re not selling you that. This is the opportunity to unlock real business value with generative AI in an internal-facing implementation, allowing you to explore the power of LLMs and some of the new architectural patterns in this brave new world.
Try structured outputs on OctoAI today
You can get started with the OctoAI Text Gen Solution today at no cost, and explore JSON mode (structured output) with models including Llama 3, Mixtral 8x22b, and even WizardLM 2 8x22b. The application used for the demo in this blog is available in the OctoAI Text Gen cookbook repository on our GitHub.
Also, check out OctoStack to bring the power of OctoAI to your private cloud.