Sign up
Log in
Sign up
Log in

Effective text summarization with Mixtral on OctoAI

Blog Author - Alex Burlacu

Mar 4, 2024

8 minutes

We’ve all had to deal with tons of text that we need, but oh soooo don’t want to read. Whether reading a novel to write a good essay, reviewing a lengthy contract to ensure we’re not missing the fine print, or checking the quarterly updates to be on top of things, we all wish to be done with those with the least reading required. In case you didn’t get it, I’m reminding you how tedious the world was before LLMs and their exquisite capability to summarize text and identify key points.

People try to use LLMs for various purposes, sometimes with good results, like chat and document processing; other times not, like using LLMs to solve reasoning problems, make discoveries, etc. Summarization is one of the best use cases for LLMs and one of the most common applications in the industry.

It’s a “solved” problem, or is it?

This post will cover how to use LLMs for long text summarization.

Why is text summarization a problem?

Remember when I sounded kinda skeptical, affirming that text summarization is a solved problem? Serving LLMs is only half of the problem. There’s more!

First, there’s serving the model. Yes, LLMs are capable, but they’re also hard to serve. Try serving one of the larger Llama 2 models, or even a 7 billion parameter Mistral, to more than a few dozen people, and you’ll see just how expensive this is. “But Alex, I need it only for myself, and I heard about this llama.cpp thing”, I hear some say. Well, sure, if you need it only for yourself, go for it, but you’ll need quite the PC to comfortably use these models, costing at least a few thousand dollars.

So unless you have to, you’ll use something like an API service and pay-per-use of API calls. With APIs, people usually look for the most cost-efficient solution.

Once you have a ready to use endpoint, there’s the input sequence limits. Any modern transformer model has a limited input sequence length. This is the reality of the architecture. And the longer this maximum input sequence is, the more complicated/expensive it will be to serve the model. So its not just a limitation, there is a real cost reason you may prefer a model with a smaller input token limit.

If you want to know why, read along; otherwise, you can skip to the next section, where I explain various techniques to overcome this limitation.

So, why exactly do transformers have limited input sequence length? The attention mechanism’s memory consumption in the transformer grows quadratically with the input size. So, as we increase the input size, the LLM will consume non-linearly more memory. The reason for this quadratic growth in memory consumption has to do with the attention mechanism itself. It’s mostly just matrix multiplications. Now, the matrices have two important dimensions, the embedding dimension and the sequence dimension, i.e., each matrix represents a stack of token embeddings that represents the input sequence; see the diagram.

Notice in the diagram the results of multiplying the Query (Q) matrix with the transposed Key (K) matrix is another matrix whose size equals the number of input tokens by the number of input tokens. This means that if we were to, for example, double the input size, we’d need 4x more memory to hold this matrix.

There’s ongoing research on improving this limitation, from more hardware-efficient attention implementations (FlashAttention) to entirely different attention mechanisms (Attention sinks and Windowed attention). These solutions are only partial, and some can even worsen the performance of transformer models over long inputs. Don’t get me wrong, this is an essential short-term research direction because we already have a ton of hardware tuned specifically for transformers. We need to make the most out of it.

But the most exciting direction is research into architectures that don’t use attention at all and are subquadratic in input size, like selective state-space models, with their most recent incarnation being Mamba.

Implementing arbitrarily long input summarizations with limited input inferences

Until we have models that can take arbitrarily long inputs, we must find a solution for the existing ones. So, what do we do, specifically for summarization tasks? Well, we split the text into manageable chunks and summarize those. The important part is, how do we combine the resulting summaries? Here, we have two main methods: MapReduce-like processing and so-called Contextual Compression or Refined Summarization.


MapReduce will summarize each chunk in parallel and then summarize the resulting summaries.

Although fast and capable of handling huge documents, MapReduce is prone to losing oversimplification and is very sensitive to the chunking scheme because if the document is split without regard to its structure, the information at chunk borders may be lost during summarization.

Refined summarization

Refined Summarization tries to improve the summarization quality over the MapReduce method, but it trades off speed for it. For Refined Summarization, we need to summarize the first chunk, but then for each subsequent chunk, we concatenate the new chunk to the summary-so-far and ask for a new summary, repeating the process until complete.

Refined Summarization is not without issues; I’m not talking only about speed. This method is prone to “recency bias”, meaning that it may decide that later information is more important and, in turn, drop important summary points from the early parts of the document.

Bonus method: Summary reranking

As you can see, both methods may result in subpar summaries. An advanced trick to mitigate this is the so-called Summary Reranking. Using a special component called reranker (which can be the same LLM with a different prompt), the system will pick the most relevant points in each summary chunk and only keep those. In a way, this will explicitly select the most important main points for a final summary that’s both nuanced and to the point. This method can be used in combination with MapReduce summarization.

Implementation details

Not all tokens are created equal. Depending on the vocabulary size of a model, one token may be an entire word, a suffix/prefix, or even just a couple of characters. Models with larger vocabularies use fewer tokens, especially for domain-specific text. For example, current Llamas and Mistrals have vocabularies of 32k unique tokens. But the older GPT2, for example, has over 50k unique tokens, and the most modern OpenAI models have a bit over 100k.

If you want to see how vocabulary size and context window size affect the cost of summarization, check this previous blog, where we directly compare Mixtral running on OctoAI and OpenAI’s GPT-3.5-Turbo.

Anyway, when it comes to cost estimation, the following formula will help you understand how much exactly you are paying for your API requests on most existing LLM providers:

TotalCost = (NumInputTokens * CostPerInputToken) + (NumOutputTokens * CostPerOutputToken)

Just like in the previous post, there’s code for this one, too. Yay! We’re still interested in long document summarization, so our demo code uses a few PDFs of various sizes for demonstration. Two guidelines from the SEC that fit into the context window of a Mixtral model, but also a Tesla quarterly report and an Apple yearly report, also from the SEC. These latter documents contain over 25 and 40 thousand words, respectively, big enough to require the techniques described above to be fully processed by the model. And finally, there’s the last document — War and Peace, by Fyodor Dostoevsky. Now, this book has almost 600 thousand words. This (or some other very long book you read and remember) should be your benchmark for long text summarization.

Alright, it’s showtime! Now that we know of various methods for summarizing long text and even have some PDFs and the code, let’s feed it all that text and see what happens. Let’s do, for example, War and Pease, using the MapReduce summarization technique with Mixtral on OctoAI. Mixtral supports exactly 32768 tokens as input, so we’ll split the text into 32768 token chunks and feed it to the model, this should be a breeze.

Not so fast! We’ll actually need to split the documents into chunks of approximately 32185 tokens. “Why these specific numbers?” I’m sure you’re asking yourself right now. Mixtral, Llama, and most other new transformer models for text generation are auto-regressive models, so we must fit both the input and output sequence within the model’s context window. Otherwise, they will skip the first few tokes, which are usually part of the system prompt. So, given that Mixtral supports up to 32768 input tokens, and assuming that our summarization should be up to 512 tokens, our input should contain no more than 32768 - 512 tokens, that is 32256 tokens. But that’s a different number! What of the other 100+ tokens? Don’t forget that we need to supply a system prompt or an instruction to tell the model that we want to summarize the given text. This instruction will be part of the input, so we must also account for this. Our default instruction is about 65 or so tokens, depending on the tokenizer. Finally, many models, including Mixtral, use special delimiter tokens to show the model what the instruction is, what the input is, and where the output should be. Given all that, now you see why we can’t use all 32k+ tokens as input.

Armed with all this information, you probably noticed something. If we have more chunks, we need to send over the same system prompts more often, and we will get 512-token summaries more often, too. All this will increase our costs. That’s right, longer context windows can help you slightly reduce the cost of summarization.

Ok, now you’re ready to rock. For those of you who skipped the section, here’s a TL;DR:

  • Not all tokens are created equal — some models may require fewer tokens to encode the same text. Models with larger vocabularies need fewer tokens to encode the text.

  • For larger-than-context window text, be mindful that you need to fit a few things in that context window beside the input text itself. Namely, (1) the system prompt, (2) some special delimiter tokens, (3) and the output text too. Otherwise, the underlying framework may silently truncate your text, and you will notice weirdly-looking outputs.

  • Longer context windows require less chunking, so there are fewer requests and a slightly smaller total number of input tokens. This is beneficial for the final cost.

Try it out on OctoAI today

Open benchmarks and reproducibility, in general, are paramount for progress. That’s why all the code used for this article is available on GitHub. Now, you too can run those benchmarks and examples and see how different models and summarization techniques can handle long texts. Don’t forget to get an OctoAI account first!