Sign up
Log in
Sign up
Log in

Pricing & Plans

Get started today on OctoAI and receive $10 of free credit in your account.

OverviewText Gen SolutionMedia Gen Solution

Text Gen Solution

The $10 credit is the equivalent of over 500,000 words with the largest Llama 2 70B model, and over a million words with the new Mixtral 8x7B model.

OctoAI’s unified API endpoint to build on your choice of open source LLMs and variants.

New
Model Remix Credits

We're giving away up to 150x bonus credits for our brand new Text Gen Solution on top of our industry-leading cost-per-token. Requires certain spend or commit to spend.

See detailed pricing
Features
Free Trial

$10

Free credit upon sign up

Get started building your project

Mixtral 8x7B
Mistral Instruct
Llama 2 Chat
Code Llama Instruct
GTE Large
Fine-tuning
Bring your choice of checkpoints
Committed use discounts
Performance optimization options
Contractual SLAs
Dedicated Customer Success Manager
Option for private deployment
Sign Up
See detailed pricing
Features
Free Trial

$10

Free credit upon sign up

Get started building your project

Pro

$0.13

Per 1M tokens* for 7B models

$0.34

Per 1M tokens* for Mixtral 8x7B

$0.86

Per 1M tokens* for 70B models

Enterprise

Contact Us

Bring your own checkpoint

Mixtral 8x7B
Mistral Instruct
Llama 2 Chat
Code Llama Instruct
GTE Large
Fine-tuning
Bring your choice of checkpoints
Committed use discounts
Performance optimization options
Contractual SLAs
Dedicated Customer Success Manager
Option for private deployment
Sign Up
Sign up
Contact us

Frequently asked questions

Don’t see the answer to your question here? Feel free to reach out so we can help.

What are your rate limits for the Text Gen Solution?

The rate limits are as follows:

  • Free Tier = 10 RPM

  • Pro Tier = 240 RPM

  • Enterprise Tier = Contact us

Higher rate limits are available, please reach out if you need an increase.
What are input and output tokens?

Tokens are units used to measure input and output text for LLMs. 1,000 tokens is about 750 words. Input tokens measure tokens in the input prompt (including context information). Output tokens are generated by the model.

What is the typical input to output token ratio?

The input to output token ratio varies, and increases for use cases where more context information is needed. A typical Retrieval Augmented Generation (RAG) question and answer implementation would see input to output ratios starting at ~ 4:1*. The blended price listed here assumes this 4:1 ratio. Detailed input and output token pricing is listed in our documentation.

How is RAG implemented?

There are multiple ways in which customers can build a RAG application on OctoAI. OctoAI allows customers to run their choice of LLMs (like Llama 2 70B, Mixtral 8x7B, Mixtral 8x22B) and embedding models (like gte-large). With these primitives, customers can use their preferred vector database as the reference data store for their RAG application. OctoAI also supports integrations with popular LLM application development frameworks like LangChain, allowing the use of pre-built functions in LangChain to simplify their RAG application development. Lastly, OctoAI supports integrations into turnkey RAG frameworks like PineCone Canopy for customers to easily implement RAG with their data.

Is it possible to pre-define a prompt?

All our Text Gen Solution code samples do have system prompts included, like: "role": "system", "content": "You are a helpful assistant." It should be noted that Mistral models do not support system prompts out of the box.

Start building with ease in minutes using OctoAI

Our mission at OctoAI is to make AI sustainable and accessible so that developers are liberated to build the next generation of intelligent applications. Sign up and enjoy the freedom to choose your model, infrastructure, and deployment templates.

Sign Up Today
Talk to sales