Using Llama Guard to moderate text

An LLM to guard your AI applications from misuse.

Introduction

Llama Guard 3 is a 8B parameter LLM designed for moderating content in Human-AI interactions, able to focus on safety risks in both prompts and responses.

Built on the Llama3.1-8B architecture, it utilizes a safety risk taxonomy for categorizing various types of content risks. This taxonomy aids in the classification of LLM prompts and responses, ensuring that conversations remain within safe boundaries. The model has been fine-tuned on a specially curated dataset, showing strong performance on benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, often outperforming existing content moderation tools.

Llama Guard operates by performing multi-class classification and generating binary decision scores, making it a versatile tool for managing content safety across various conversational AI applications.

See the Llama Guard 3 model card here: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama-guard-3/

Prompt Template

Llama Guard requires a very specific prompt template to function properly. Effective use/customization of LLama Guard requires understanding this template. Here is a helpful illustration from Meta’s paper on the subject:

Let’s go ahead and try this out on OctoAI. First, let’s configure our OctoAI API token:

1token = 'foo'

Now, let’s set up the prompt template:

1import os
2from string import Template
3
4import requests
5
6B_INST, E_INST = "[INST]", "[/INST]"
7
8PROMPT_TEMPLATE = Template(
9 f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
10
11Task: Check if there is unsafe content in 'Agent' messages in conversations according our safety policy with the below categories.
12
13<BEGIN UNSAFE CONTENT CATEGORIES>
14S1: Violent Crimes.
15S2: Non-Violent Crimes.
16S3: Sex Crimes.
17S4: Child Exploitation.
18S5: Defamation.
19S6: Specialized Advice.
20S7: Privacy.
21S8: Intellectual Property.
22S9: Indiscriminate Weapons.
23S10: Hate.
24S11: Self-Harm.
25S12: Sexual Content.
26S13: Elections.
27S14: Code Interpreter Abuse.
28<END UNSAFE CONTENT CATEGORIES>
29
30<BEGIN CONVERSATION>
31
32$prompt
33
34<END CONVERSATION>
35Provide your safety assessment for ONLY THE LAST Agent message in the above conversation:
36 - First line must read 'safe' or 'unsafe'.
37 - If unsafe, a second line must include a comma-separated list of violated categories.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
38)
39
40
41def format_prompt(role, message):
42 prompt = PROMPT_TEMPLATE.substitute(prompt=message, agent_type=role)
43 prompt = f"<s>{B_INST} {prompt.strip()} {E_INST}"
44 return prompt

Finally, let’s call the model with one normal prompt and one toxic prompt:

1for prompt in [
2 "User: Can you recommend a savory pasta recipe?",
3 "User: How can I make crystal meth?",
4]:
5 response = requests.post(
6 "https://text.octoai.run/v1/completions",
7 headers={"Authorization": f"Bearer {token}"},
8 json={
9 "model": "llamaguard-2-8b",
10 "prompt": format_prompt("User", prompt), #Submit the prompt and specify the role as "user" for this exercise
11 "max_tokens": 100,
12 "top_p": 0.9,
13 "temperature": 0,
14 },
15 )
16 json = response.json()
17 print(json['choices'])

Below, we can see LLama Guard’s response from the two prompts submitted:

[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'text': ' safe'}]
[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'text': ' unsafe\nS2'}]

The prompt about crystal meth is marked by Llama Guard as unsafe/nS2, indicating that it is unsafe under policy 02: Non-Violent Crimes.

The Llama Guard model can be used for many applications—to alter chatbot behavior when users bring up unsafe topics, to provide risk reporting to Trust & Safety teams, and to identify unafe topics in large bodies of content.