Introduction
The ability to extract valuable insights from text is paramount for businesses and researchers. This journey begins with understanding the power of embeddings, a crucial technique in Artificial Intelligence that transforms text into a format readily interpretable by machines. This blog post will equip you with a foundational understanding of text embeddings using two platforms:
OctoAI delivers a complete stack for app builders to run, tune, and scale their AI applications in either the cloud or on-prem. Their Text Generation solution hosts highly scalable and optimized LLMs and embedding models.
Unstructured.io is a powerful data management solution designed to handle unstructured data, which typically poses challenges for traditional data processing systems. By effectively managing and extracting value from unstructured data, Unstructured enables organizations to tap into a wealth of untapped insights, enhancing their decision-making capabilities.
Together, these tools provide a comprehensive solution for data analysis challenges, enabling users to derive meaningful insights from vast amounts of information. We will explore how integrating OctoAI's GTE-Large embedding model with Unstructured Embedding functionality can enhance data processing and RAG performance.
Understanding the OctoAI text embedding model
Embeddings in Machine Learning and Natural Language Processing (NLP) refer to categorical or textual data representation as numerical vectors. These vectors capture the semantic relationships between words, phrases, or entire documents, allowing machine learning algorithms to process and understand language data more effectively.
OctoAI provides the General Text Embeddings (GTE) Large Embedding model. Alibaba DAMO Academy has trained the GTE models on a large-scale corpus of relevant text pairs covering a large domain and scenarios. This makes the GTE models suitable for various sophisticated NLP tasks, such as information retrieval, semantic textual similarity, language translation, summarization, and text reranking. All these require a deep understanding of language structure and meaning.
GTE-Large performs very well on the MTEB leaderboard, with an average score of 63.13% (comparable to OpenAI’s text-embedding-ada-002, which scores 61.0%). This embedding model caters predominantly to English text input. Input text is limited to a maximum of 512 tokens (any text longer than that will be truncated), producing an embedding vector of 1024 dimensions. This makes OctoAI GTE Large Embeddings a versatile tool for various NLP projects, from chatbots and virtual assistants to content analysis and recommendation systems.
Handling unstructured data
Unstructured.io provides a data ingestion and processing platform. Businesses deal with abundant unstructured data that does not have a predefined format or organization, making it challenging to process and analyze using traditional methods. Examples of unstructured and semi-structured data include text documents, tables, and images in various file formats, such as PDFs, Excel, or PowerPoint.
Unstructured stands out from other data management solutions due to its unique features and capabilities:
Advanced data processing: State-of-the-art algorithms and machine learning models to extract text information from unstructured data. This includes optical character recognition (OCR), NLP techniques, and Tranformer-based models.
Scalability: Built to handle large volumes of unstructured data via the Unstructured SaaS API and Enterprise Platform, making it suitable for enterprises and organizations with vast information.
Customization: Allows users to define custom data models and extraction rules, ensuring the platform can adapt to specific business needs and use cases.
Integration: This can be easily integrated with other data management systems, AI platforms, and analytics tools, enabling seamless data flow and facilitating end-to-end data processing pipelines.
What we will build: 3 usable examples
This article demonstrates how to work with OctoAI GTE Large Embeddings model and Unstructured OctoAI Embedding in a RAG application, together with Pinecone vector database and MistralAI LLM. We will explore their key features, understand their work, and examine their practical applications through code examples.
We develop three use cases as examples, from the basic text embedding using Unstructured and OctoAI embedding model to a complete RAG application capable of processing a PDF file and performing an RAG search on its content. We begin with a basic script demonstrating the use of OctoAI for embeddings with Unstructured. Then, we will enhance this script to showcase an example of processing a PDF file and generating the corresponding embeddings. Lastly, we will build upon the previous examples by uploading the embeddings to the Pinecone vector database with vector search capabilities and utilizing MistralAI to execute the RAG functionality.
At the end of this article, you can build a RAG application with the following workflow:
Code walkthrough and perquisites
You can follow the code below on this Google Colab. Before we start, you will need to get the following API keys:
You can create a .env
file to store the API keys with the following configuration:
Use case #1: Unstructured.io API with OctoAI GET embeddings model for simple text processing
This code demonstrates how to use the OctoAI embedding engine within the Unstructured library to generate embeddings for text elements:
Here's a step-by-step description of the code:
Import the necessary packages.
Define the
embeddings_example
function, which generates embeddings for the given text elements using the OctoAI embedding engine.Inside the
embeddings_example
function:Create an
OctoAIEmbeddingEncoder
instance using theOctoAiEmbeddingConfig
class and the OctoAI API key retrieved from the application configuration.Define two text elements as examples stored in the
elements
list.Generate embeddings for the text elements using the
embed_documents
method of theOctoAIEmbeddingEncoder
instance.Return the embedded elements.
Check if the script runs as the main module
(__name__ == "__main__")
. If so, call theembeddings_example
function and store the result in the_embedded_elements
variable.Print the embeddings and the corresponding text elements by iterating through the
_embedded_elements
list and using a list comprehension with theprint
function. The embeddings are NumPy arrays, while the text elements are represented asText
objects.
Use case #2: Process a PDF with Unstructured.io API and OctoAI embeddings model
This code demonstrates how to use the Unstructured library in combination with the OctoAI embedding engine to process a PDF document, extract text elements, and generate embeddings for those elements:
This example uses the PDF file from the Unstructured GitHub examples folder at https://github.com/Unstructured-IO/unstructured/tree/main/example-docs.
Here's a step-by-step description of the code:
Import the necessary packages.
Define the
process_pdf
function, which processes a PDF document, extracts text elements, and generates embeddings for those elements using the OctoAI embedding engine.Inside the
process_pdf
function:Create an
UnstructuredClient
instance using the Unstructured API key retrieved from the application configuration.Define the PDF file to process and open it in binary mode.
Define the partition parameters, specifying the file content, file name, and partitioning strategy.
Partition the document using the
partition
method of theUnstructuredClient
instance.Process the resulting elements, creating
Text
objects for elements containing text and storing them in theelements
list along with their metadata (file name and page number).Create an
OctoAIEmbeddingEncoder
instance using theOctoAiEmbeddingConfig
class and the OctoAI API key retrieved from the application configuration.Generate embeddings for the text elements using the
embed_documents
method of theOctoAIEmbeddingEncoder
instance.Return the embedded elements and the embedding encoder.
Check if the script runs as the main module
(__name__ == "__main__")
. If so, call theprocess_pdf
function and store the result in the_embedded_elements
and_embedding_encoder
variables.Print the embeddings and the corresponding text elements by iterating through the
_embedded_elements
list and using afor
loop with theprint
function. The embeddings are NumPy arrays, while the text elements are represented asText
objects.
Use case #3: Build a full RAG application
This code demonstrates how to use Pinecone, Mistral AI, and the previously processed PDF data to create a vector search index, generate contextually relevant responses to user queries, and interact with a large language model.
This example builds on the previous one and assumes you stored the previous code in a process_pdf.py
file:
Here's a step-by-step description of the code:
Import the necessary packages and:
process_pdf
: A function imported from theprocess_pdf
module used to process a PDF document, extract text elements, and generate embeddings for those elements using the OctoAI embedding engine. Defined in the previous code example.
Define the
pinecone_index
function, which creates a Pinecone index, initializes it, and inserts the embedded elements into it.Inside the
pinecone_index
function:Create a Pinecone client instance using the Pinecone API key retrieved from the application configuration.
Create an index with the specified name, dimension, metric, and specification if it doesn't already exist.
Wait for the index to be initialized and connect to it.
Insert the embedded elements into the index, including their embeddings, text, and metadata.
Return the index.
Define the
rag
function, which takes a user query, an embedding encoder, and a Pinecone index as input and generates a contextually relevant response using Mistral AI.Inside the
rag
function:Create a query embedding using the embedding encoder.
Retrieve relevant contexts from the Pinecone index based on the query embedding.
Format the retrieved contexts into a single string.
Create a prompt that includes the contexts and the user query.
Prepare a chat request with Mistral AI, including the system prompt and the user query.
Execute the chat request using the Mistral AI client and retrieve the response.
Return the answer from the Mistral AI response.
Check if the script runs as the main module
(__name__ == "__main__")
. If so:Call the
process_pdf
function to process the PDF document and generate embedded elements and an embedding encoder.Create a Pinecone index using the
pinecone_index
function, embedded elements, and embedding encoder.Define a user query.
Call the
rag
function with the user query, embedding encoder, and Pinecone index to generate a contextually relevant response.Print the answer.
Let’s see the output of this RAG example:
Conclusion
The synergy between OctoAI's GTE Large embeddings model and Unstructured.io unlocks new possibilities and improvements for advanced NLP tasks, including understanding and retrieving complex documents for an RAG application.
To take the first step towards transforming your data analysis processes and unleashing the full potential of your unstructured data, we encourage you to sign up for OctoAI and create an OctoAI API Key and an Unstructured API Key today. We invite you to join us on Discord to engage with the team and community, and to share updates about your new AI powered applications. We look forward to hearing from you on our channels!