Unstructured.io Integration

The OctoAIEmbedingEncoder is available, so documents parsed with Unstructured can easily be embedded with the OctoAI embeddings endpoint.

Introduction

Unstructured is both an open-source library and an API service. The library provides components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more.

It also provides components to very easily embed these documents. In Unstructured’s jargon this component is called an EmbeddingEncoder. The OctoAIEmbedingEncoder is available, so documents parsed with Unstructured can easily be embedded with the OctoAI embeddings endpoint.

Using the OctoAIEmbeddingEncoder

Before you get started let’s review some concepts. You will need an OctoAI API Token for this integration.

  • The OctoAIEmbeddingEncoder class connects to the OctoAI Text&Embedding API to obtain embeddings for pieces of text.
  • embed_documents will receive a list of Elements, and return an updated list which includes the embeddings attribute for each Element.
  • embed_query will receive a query as a string, and return a list of floats which is the embedding vector for the given query string.
  • num_of_dimensions is a metadata property that denotes the number of dimensions in any embedding vector obtained via this class.
  • is_unit_vector is a metadata property that denotes if embedding vectors obtained via this class are unit vectors.

Now, let’s get started with the following code example for how to use OctoAIEmbeddingEncoder. You will see the updated elements list (with the embeddings attribute included for each element), the embedding vector for the query string, and some metadata properties about the embedding model. You will need to set an environment variable named OCTOAI_API_KEY to be able to run this example.

1import os
2
3from unstructured.documents.elements import Text
4from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder
5
6embedding_encoder = OctoAIEmbeddingEncoder(
7 config=OctoAiEmbeddingConfig(api_key=os.environ["OCTOAI_API_KEY"])
8)
9elements = embedding_encoder.embed_documents(
10 elements=[Text("This is sentence 1"), Text("This is sentence 2")],
11)
12
13query = "This is the query"
14query_embedding = embedding_encoder.embed_query(query=query)
15
16[print(e.embeddings, e) for e in elements]
17print(query_embedding, query)
18print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

Learn more about our partnership: