Text Gen Solution

Text Gen Python SDK

Use the OctoAI Chat Completion API to easily generate text.

The OctoAI class allows you to run inferences simply to any model that accepts JSON-formatted inputs as a dictionary, and provides you with all JSON-formatted outputs as a dictionary. The OctoAI class also supports the Chat Completions API and provides easy access to a set of highly optimized text models on OctoAI.

This guide will walk you through how to select your model of interest, how to call highly optimized text models on OctoAI using the Chat Completions API, and how to use the responses in both streaming and regular modes.


  • Please create an OctoAI API token if you don’t have one already.
  • Please also verify you’ve completed Python SDK Installation & Setup.
    • If you use the OCTOAI_TOKEN envvar for your token, you can instantiate the OctoAI client with octoai = OctoAI() after importing the octoai package.

Text Generation

The following snippet shows you how to use the Chat Completions API to generate text using Llama2.

1import json
3from octoai.client import OctoAI
4from octoai.text_gen import ChatMessage
6client = OctoAI()
7completion = client.text_gen.create_chat_completion(
8 model="meta-llama-3-8b-instruct",
9 messages=[
10 ChatMessage(
11 role="system",
12 content="Below is an instruction that describes a task. Write a response that appropriately completes the request.",
13 ),
14 ChatMessage(role="user", content="Write a blog about Seattle"),
15 ],
16 max_tokens=150,
19print(json.dumps(completion.dict(), indent=2))

The response is of type octoai.text_gen.ChatCompletionResponse. If you print the response from this call as in the example above, it looks similar to the following:

"id": "cmpl-8ea213aece0747aca6d0608b02b57196",
"choices": [
"index": 0,
"message": {
"role": "assistant",
"content": "Founded in 1921, Seattle is the mother city of Pacific Northwest. Seattle is the densely populated second-largest city in the state of Washington along with Portland. A small city at heart, Seattle has transformed itself from a small manufacturing town to the contemporary Pacific Northwest hub to its east. The city's charm and frequent unpredictability draw tourists and residents alike. Here are my favorite things about Seattle.\n* Seattle has a low crime rate and high quality of life.\n* Seattle has rich history which included the building of the first Pacific Northwest harbor and the development of the Puget Sound irrigation system. Seattle is also home to legendary firm Boeing.\n",
"function_call": null
"delta": null,
"finish_reason": "length"
"created": 5399,
"model": "meta-llama-3-8b-instruct",
"object": "chat.completion",
"system_fingerprint": null,
"usage": {
"completion_tokens": 150,
"prompt_tokens": 571,
"total_tokens": 721

Note that billing is based upon “prompt tokens” and “completion tokens” above. View prices on our pricing page.

Streaming Responses

The following snippet shows you how to obtain the model’s response incrementally as it is generated using streaming (using stream=True).

1from octoai.client import OctoAI
2from octoai.text_gen import ChatMessage
4client = OctoAI()
5for completion in client.text_gen.create_chat_completion_stream(
6 model="meta-llama-3-8b-instruct",
7 messages=[
8 ChatMessage(
9 role="system",
10 content="Below is an instruction that describes a task. Write a response that appropriately completes the request.",
11 ),
12 ChatMessage(role="user", content="Write a blog about Seattle"),
13 ],
14 max_tokens=150,
16 print(completion.choices[0].delta.content, end='', flush=True)

When using streaming mode, the response is of type Iterable[ChatCompletionChunk]. To read each incremental response from the model, you can use a for loop over the returned object. The example above prints each incremental response as it arrives, and they accumulate to form the entire response in the output as the model prediction progresses.

Additional Parameters

To learn about the additional parameters supported by the OctoAI().text_gen.create_chat_completion() method.