Data is a company's most prized possession, but its value lies in your ability to extract helpful insights. Typically, data engineers and analysts hand craft SQL queries to derive key insights, a process that requires expertise and time, often taking days for simple business questions.
Large Language Models (LLMs) can revolutionize this process by instantly translating natural language queries into SQL, making it easier to get an actionable understanding of your business from your data.
We introduce a Natural Language Query (NLQ) engine which performs natural language to SQL++ translation with simple yet effective means: an open source Llama 3.1 LLM model running on OctoAI, some clever prompt engineering, and a Couchbase NoSQL database.
You’ll learn in this article:
How to use LLMs to generate a JSON description of the collections stored in your NoSQL database.
How to build a Natural Language Query (NLQ) engine that uses that JSON information and a self-correcting LLM inference loop to generate valid SQL queries.
How to build and deploy a simple “travel assistant” chatbot that uses the NLQ engine to retrieve context from a NoSQL Couchbase database and produce a helpful answer to it’s end user.
To motivate this NLQ engine, let’s look at an example that you’ll be able to run and extend. We start with a NoSQL database that contains information about hotels. Your goal is to power a travel agent chat bot that can assist a customer by retrieving relevant context from your database.
If the user query is:
What are the top five best rated hotels in Paris that offer free breakfast and are pet friendly?
The NLQ engine produces the following code from the natural query above:
Running this query against a database returns the data frame containing the relevant hotel information ranked by ratings that meet the criteria set by the customer. That dataframe can then be provided to an LLM that produces the final answer back to the customer.
See:
By the end of this article, you’ll learn how to build a NLQ engine and leverage it to power a travel agent chatbot as shown below.
Let’s go over the steps needed to deliver this outcome.
1. Couchbase, NoSQL database for the enterprise
For this walk-through we’re going specifically generate SQL++ queries to execute on Couchbase databases.
Couchbase is a high-performance NoSQL database that combines key-value storage with advanced querying, analytics, along with full-text and vector search. It’s designed for real-time applications, offering flexible JSON data models, easy scalability, high availability, and low latency.
We can query Couchbase using a SQL-like dialect named SQL++, that looks like this:
In our search for tutorials and code for LLM-powered SQL generation, we found plenty of content on SQL generation, it was slightly harder to find content adapted to the SQL++ dialect or targeting Couchbase more specifically.
It’s worth noting Couchbase has Capella iQ which is an LLM-powered code assistant built into the Capella Database-as-a-Service and as an extension in popular IDEs, but we want to show how you can DIY such a service, and extend it to fit your needs.
You can sign up for a free Capella database to use a fully managed database, pre-loaded with the travel dataset that we’ll be using in this tutorial.
2. Overview of sample data
In this tutorial, we’ll use the Travel Sample Data model, which you can learn more about in depth by clicking here.
This data model is comprised of five main collections:
Airlines
Airports
Hotels
Routes
Landmarks
Here’s a sample of the documents in the Hotel collection:
3. Installation of packages and dependencies
Our next step will be to set up our development environment. All of the files required to replicate this natural language query engine are available on Github.
We’ll use virtualenv
to create the dev environment of our project. Execute the following commands to create and install the dependencies:
Next, create a .env
file with the following variables:
The COUCHBASE_PASSWORD
, COUCHBASE_URL
and COUCHBASE_USERNAME
values correspond to the values of the sample database you can create once you follow Couchbase’s onboarding process.
To get an OCTOAI_API_TOKEN
, follow these instructions to get started with a $10 credit for tokens.
Let’s have a quick overview of the files used in this project:
In
config/constants.py
(link) we’re storing the following values related to Couchbase:
In
resources/prompts
(link) we’ll store the prompts used in the different stages of the NLQ Engine pipeline. We’ll review them later on.And in
services
(link) we’re storing convenient classes to abstract away our connection to Couchbase and the NLQEngine itself.In
services/db_client.py
(link) we set up our connection to Couchbase in the constructor. Theget_sample()
(link) method just returns a limited number of results from a given collection, and under the hood, it uses therun_raw_query()
(link), which we use to execute SQL++ queries directly into the bucket and scope of interest.
4. Building JSON schemas for each collection
In order to generate proper SQL++ queries, we’ll need to feed a description of what’s inside each collection to the LLM.
In a traditional relational database, this would be given by the DDL statements, particularly the CREATE TABLE
instructions. However, the dynamic nature of Couchbase being a NoSQL database prevents us from having a fixed definition of the elements in a collection.
The solution is to take a sample of documents, and use them to come up with a JSON Schema that roughly describes the contents of the collections.
We could do this manually, but it’s much more productive to use an LLM after all! Here’s the prompt we used (link):
Now, the magic happens when we pass this to Llama 3.1. For this we’ll use OctoAI and LangChain.
The class CouchbaseNLQEngine
, which lives in services/nlq_engine.py
(link) encapsulates this process in the generate_db_schema()
method (link).
Let’s take at how we instantiate the OctoAIEndpoint
client and a ChatOctoAI
client (link), both connected to our hosted Llama-3.1 LLM.
Next, let’s take a look at generate_db_schema()
(link):
Basically, what we do is take a sample of 20 documents for each collection, and send them to Llama-3.1, asking it to do the heavy lifting (creating the JSON schema) for us.
The result can be found in resources/collection_schemas.json
(link) and provides a comprehensive description of the collection schemas, all generated by an LLM! Neat, right?
5. NLQEngine: Translating natural language queries into SQL++
We have arrived at the most interesting part of this article!
With the collection schemas we generated in the last step, we now have nearly all the pieces we need to go from natural language to SQL++.
The last piece we must add is a proofreading step to ensure quality output. As we're automating the effort of a data scientist or analyst, we need to instruct the LLM to do the kind of quality assurance work typically coded by hand.
We are incorporating in this proof-reading step some domain specific information, such as the fact that country names need to be formatted in a specific way (e.g. use “United States” over “USA”). Performing some Exploratory Data Analysis (EDA) beforehand is a good practice to help bump an LLM to the right direction, regardless of the particularities of the task we want them to perform.
Here’s the prompt we use to perform proofreading (link):
The method proofread_query()
executes this prompt against Llama-3.1 (link).
The next step is to use our generated JSON schemas, as well as the proofread query to (hopefully) produce a syntactically correct SQL++ query. We achieved this by engineering the following prompt (link):
To run the query, we implement the run_query()
(link) method in CouchbaseNLQEngine
. It receives the schemas of the collections, the query as a string and the maximum number of times we’ll try to produce a syntactically correct query.
Re-tries here are crucial, because sometimes the LLM makes errors in the SQL++ query it generates. The best way to help it correct itself, is to send it the exception trace obtained from running the query and try again.
Behold, the entire NLQ engine steps summarized below:
We begin by proofreading the input query.
Then we set up our system message, along with the user message, which is just the query generation prompt which includes the proper context.
Next, we invoke the LLM to generate the query, by calling the
chat_llm
instance we created at the beginning, in the constructor.Finally, we run the raw query produced by the LLM against Couchbase. If we get an exception, we append both the generated query, and the error message and try again.
Once we get a syntactically correct query that runs without hiccups, we return the result. If we reach the maximum number of retries, we give up and throw an exception.
6. Putting everything together
We can knit everything together into:
A simple Gradio interface to the NQL engine (take a look at
main.py
- link) andA travel agent chatbot that uses the NQL engine to generate a helpful answer to an end user (take a look at
travel_agent.py
- link).
Instructions to set up these apps are available on the project README.
Gradio interface to the NQL engine
You can launch the Gradio interface script by executing the following command:
First we instantiate our database client as well as our NLQ Engine. We try to find a generated version of the collections schemas, and if we fail, we create and save a new one.
Then we launch a Gradio user interface that takes a natural language query and produces JSONs, corresponding to the documents that satisfy our natural language queries.
Navigate to http://127.0.0.1:7860
and start interacting with the app.
Example 1
Query: Give me 3 hotels in France
.
Under the hood, this is what happened:
→ Query: Give me 3 hotels in France
→ Proofread query: Give me 3 hotels in France.
→ Generated SQL++ query:
So far so good. Let’s try something a bit more challenging:
Example 2
Query: I wanna fly to europ. what airports r there? give me thre optionos
Notice the typos which are intentional. Here are the results:
→ Query: I wanna fly to europ. what airports r there? give me thre optionos
→ Proofread query: I want to fly to the United Kingdom and France. What airports are there? Give me three options.
→ Generated SQL++ query:
Travel Agent Chatbot Streamit App
Now that we have an understanding of the inner workings of the NLQ engine, let’s use it to power a travel agent chatbot. We’re going to take the concept of “natural language” to the extreme here by actually translating voice to SQL++ queries. This will be possible by utilizing a Whisper model that does voice to text transcription.
Let’s take a look at the high level architecture of this app:
A user sends their voice command to the app, e.g. “What are the top five best rated hotels in Paris that offer free breakfast and are pet friendly?”
Whisper translates the voice command to text.
We sent the text to the NQL query engine that translates the natural language query into SQL++ that can run in Couchbase.
We execute the SQL++ query in Couchbase and get a dataframe from it, which we pass as context to an LLM along with the original user question to produce a helpful answer.
Run the streamlit app by executing the following:
See the app screenshot which shows all of the intermediate steps in producing a helpful answer back to the user:
Conclusion
In this tutorial, we learned how to harness the power of state-of-the-art LLMs to translate Natural Language Queries into syntactically correct SQL++ queries, which can be executed directly against a Couchbase database instance.
By combining prompt engineering, OctoAI’s easy to use SDK and Couchbase’s high-performance capabilities, we built a simple yet powerful abstraction that allow us to explore any dataset, regardless of size or structure with plain English.
SQL generation from natural language is a hard problem - we recognize that this tutorial provides a starting point for people to build powerful solutions, there is room for improvement. Our aim is to provide your teams with a solid foundation to build from. We plan to release more content that introduces optimizations and mechanisms to iterate and improve upon the NQL engine concept.