Last week Mistral AI shocked everyone by releasing a larger variant of their groundbreaking Mixtral architecture–the new Mixtral-8x22B model. Its predecessor, Mixtral-8x7B, has been the most popular open LLM, so we were very excited by this new contribution and hosted it quickly. And just today, Mistral AI released their official 8x-22-instruct variant—we’re launching that in the coming hours.
But the fine tunes keep coming too! On Monday, Microsoft announced WizardLM-2, a fine-tuned Mixtral-8x22B, with much fanfare:
OctoAI moved fast to host WizardLM-2. You can try it out in our playground here. (It’s listed as an “experimental” model, hosted under the name “mixtral-8x22b-finetuned,” because we’ll be evolving this endpoint rapidly to offer the best fine-tuned variants of Mixtral-8x22B).
However, almost as fast as OctoAI could get it launched, Microsoft withdrew the model from Github and HuggingFace because they had missed a step in the process: toxicity testing.
Predictably, this led to a surge in interest, with Redditors clamoring for the model they presumed was toxic, and many folks mirroring the model on HuggingFace. Oh, Reddit. 🤦
This also kicked off a conversation on Hacker News about toxicity testing — what it is, how it's done, and what it's for.
OctoAI partners with Vijil for toxicity testing
At OctoAI, we believe that testing and mitigating toxicity — inherent in LLMs trained on Internet data — is important, especially when these models are deployed in business-critical applications. To help our customers deploy models with confidence, we advise rigorous independent testing before models reach production. With the conversation around toxicity testing heating up, we decided to partner with Vijil to run toxicity tests on WizardLM-8x22B and compare it to the base model, as well as Mixtral 8x7B Instruct.
Vijil provides cloud services to evaluate, improve, and maintain trust in AI agents. The Vijil eval service scores LLMs on the reliability of performance on domain-specific tasks, vulnerability to attacks, and propensity for harms under benign and hostile conditions. To score propensity for harms, it measures the likelihood that a model will produce toxic outputs, malicious stereotypes, unfair outcomes, and unethical responses by subjecting it to automated red-teaming tests.
Vijil red-team tests for toxicity show that WizardLM-8x22B has a score of 98.33 compared to the base Mixtral-8x22B score of 89.46 and Mixtral 8x7B-Instruct score of 92.93 (higher is better).
Vijil Trust Score: Toxicity | Mixtral 8x7B Instruct | Base Mixtral 8x22B | WizardLM 8x22B |
---|---|---|---|
Test & Prompts | Pass Rate (%) | Pass Rate (%) | Pass Rate (%) |
Continuation - 2,850 | 100 | 100 | 100 |
Adversarial Attacks - 50 | 100 | 100 | 96.0 |
Flirtation - 1,000 | 99.3 | 93.0 | 99.4 |
Identity Attacks - 1,000 | 99.0 | 86.8 | 100 |
Insult - 1,000 | 95.4 | 75.6 | 96.5 |
Profanity - 1,000 | 96.4 | 79.6 | 96.4 |
Severe Toxicity - 1,000 | 96.8 | 79.8 | 97.4 |
Sexually Explicit - 1,000 | 99.5 | 93.7 | 99.3 |
Threat - 1,000 | 100 | 96.6 | 100 |
Aggregate - 9,900 | 92.93 | 89.46 | 98.33 |
Vijil tests a model for several categories of toxicity. The Vijil evaluation framework obtains test prompts from several sources: collected from the realtoxicityprompts (RTP) benchmark dataset, generated by a custom adversarial model, and hand-crafted by red-team experts. A test uses the most provocative prompts (according to the prompt authors) in each category and evaluates responses using a toxicity detection model. This standards-based approach means that anyone can use the Vijil eval service to measure model risks systematically.
Get started on OctoAI and learn more about toxicity testing
If you want to learn more about how Vijil automates red-teaming at scale, send a note to contact@vijil.ai. Developers can try WizardLM-8x22B on OctoAI today.