Using Unstructured.io for embedding documents
Fast and easy document parsing and embedding using Unstuctured.io and OctoAI.
Introduction
Unstructured is both an open-source library and an API service. The library provides components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more.
It also provides components to very easily embed these documents. In Unstructured’s jargon this component is called an EmbeddingEncoder
. The OctoAIEmbedingEncoder
is available, so documents parsed with Unstructured can easily be embedded with the OctoAI embeddings endpoint.
Using the OctoAIEmbeddingEncocer
The OctoAIEmbeddingEncoder
class connects to the OctoAI Text&Embedding API to obtain embeddings for pieces of text.
embed_documents
will receive a list of Elements, and return an updated list which includes the embeddings attribute for each Element.
embed_query
will receive a query as a string, and return a list of floats which is the embedding vector for the given query string.
num_of_dimensions
is a metadata property that denotes the number of dimensions in any embedding vector obtained via this class.
is_unit_vector
is a metadata property that denotes if embedding vectors obtained via this class are unit vectors.
The following code block shows an example of how to use OctoAIEmbeddingEncoder
.
You will see the updated elements list (with the embeddings
attribute included for each element),
the embedding vector for the query string, and some metadata properties about the embedding model.
You will need to set an environment variable named OCTOAI_API_KEY
to be able to run this example.
To obtain an API key, visit: How to create an OctoAI API token.