Using Unstructured.io for embedding documents

Fast and easy document parsing and embedding using Unstuctured.io and OctoAI.

Introduction

Unstructured is both an open-source library and an API service. The library provides components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more.

It also provides components to very easily embed these documents. In Unstructured’s jargon this component is called an EmbeddingEncoder. The OctoAIEmbedingEncoder is available, so documents parsed with Unstructured can easily be embedded with the OctoAI embeddings endpoint.

Using the OctoAIEmbeddingEncocer

The OctoAIEmbeddingEncoder class connects to the OctoAI Text&Embedding API to obtain embeddings for pieces of text.

embed_documents will receive a list of Elements, and return an updated list which includes the embeddings attribute for each Element.

embed_query will receive a query as a string, and return a list of floats which is the embedding vector for the given query string.

num_of_dimensions is a metadata property that denotes the number of dimensions in any embedding vector obtained via this class.

is_unit_vector is a metadata property that denotes if embedding vectors obtained via this class are unit vectors.

The following code block shows an example of how to use OctoAIEmbeddingEncoder. You will see the updated elements list (with the embeddings attribute included for each element), the embedding vector for the query string, and some metadata properties about the embedding model. You will need to set an environment variable named OCTOAI_API_KEY to be able to run this example. To obtain an API key, visit: How to create an OctoAI API token.

1import os
2
3from unstructured.documents.elements import Text
4from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder
5
6embedding_encoder = OctoAIEmbeddingEncoder(
7 config=OctoAiEmbeddingConfig(api_key=os.environ["OCTOAI_API_KEY"])
8)
9elements = embedding_encoder.embed_documents(
10 elements=[Text("This is sentence 1"), Text("This is sentence 2")],
11)
12
13query = "This is the query"
14query_embedding = embedding_encoder.embed_query(query=query)
15
16[print(e.embeddings, e) for e in elements]
17print(query_embedding, query)
18print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())