Using Llama Guard to moderate text
An LLM to guard your AI applications from misuse.
Introduction
Llama Guard 3 is a 8B parameter LLM designed for moderating content in Human-AI interactions, able to focus on safety risks in both prompts and responses.
Built on the Llama3.1-8B architecture, it utilizes a safety risk taxonomy for categorizing various types of content risks. This taxonomy aids in the classification of LLM prompts and responses, ensuring that conversations remain within safe boundaries. The model has been fine-tuned on a specially curated dataset, showing strong performance on benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, often outperforming existing content moderation tools.
Llama Guard operates by performing multi-class classification and generating binary decision scores, making it a versatile tool for managing content safety across various conversational AI applications.
See the Llama Guard 3 model card here: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama-guard-3/
Prompt Template
Llama Guard requires a very specific prompt template to function properly. Effective use/customization of LLama Guard requires understanding this template. Here is a helpful illustration from Meta’s paper on the subject:
Let’s go ahead and try this out on OctoAI. First, let’s configure our OctoAI API token:
Now, let’s set up the prompt template:
Finally, let’s call the model with one normal prompt and one toxic prompt:
Below, we can see LLama Guard’s response from the two prompts submitted:
The prompt about crystal meth is marked by Llama Guard as unsafe/nS2
, indicating that it is unsafe under policy 02: Non-Violent Crimes.
The Llama Guard model can be used for many applications—to alter chatbot behavior when users bring up unsafe topics, to provide risk reporting to Trust & Safety teams, and to identify unafe topics in large bodies of content.