Sign up
Log in
Sign up
Log in
On-demand webinar
Learn from our technical deep dive into using function calling to develop AI agents.
Watch now
blue simplified deployment iconOCTOSTACK

Run GenAI in your

OctoStack is a turnkey production GenAI serving stack that delivers highly-optimized inference at enterprise scale.

Request a Demo

Efficient reliable self-contained GenAI deployment

OctoStack allows you to run your choice of models in your environment, including any cloud platform, VPC, or on-premise, ensuring full control over your data. This solution encompasses state-of-the-art model serving technology meticulously optimized at every layer, from data input to GPU code.

Overview diagram of how OctoStack by OctoAI would work in your infrastructure environment

GenAI serving stack

OctoAI removes the complications of running, managing, and scaling GenAI systems, so you can focus on developing your AI apps and projects.

OctoAI is built upon systems and compilation technologies we launched: XGBoost, Apache TVM, and MLC/MLC-LLM, providing you an enterprise system running in your private environment.

Diagram overview of OctoAI systems showing APIs, solutions, Soc 2Type 2 certification, and support for hardware and environments

OctoStack delivers on our performance and security-sensitive use case. It lets us easily and efficiently run the customized models we need within the environments we choose and supports the scale our customers require.

Dali Kaafar portrait

Dali Kaafar

CEO Apate AI

Apate AI logo
acceleration optimization

OctoStack's 10x performance boost

We built optimization technologies into the OctoAI systems stack, and these are available for builders to turn on or off, as needed, for their deployments. Benchmarking shows 4x to 5x improvements in low concurrency use cases, and up to 10x or more for larger scale deployments with tens to hundreds of concurrent users.

See how
Multi-user Throughput of vLLM compared to OctoStack chart

See 4x GPU utilization improvements

Maximize the effectiveness of your GPUs when you combine them with OctoStack’s optimized serving layer. Instantly reduce costs and latency compared to proprietary model providers and DIY deployment methods.

Get the most from your data with OctoStack and Snowflake

Generative AI is making it easier for companies to derive more value from their data by using LLMs in their secure environment. Using Retrieval Augmented Generation, RAG, with OctoStack in Snowflake’s Snowpark expands the use of your own datasets, so users can conversationally ask questions and improve business outcomes.

See how
Diagram showing how RAG can work using OctoStack in your environment to enrich you existing data and user usage of that data

Securely and confidently run GenAI at scale

Benefit from OctoAI's expertise in inference optimization while meeting privacy and compliance requirements. Scale enterprise applications with reliable performance.

Learn more about OctoStack

Review these resources to get an in-depth understanding of how OctoStack can expedite running GenAI in your environment — privately and securely.

Document icon in purple

OctoStack Info Brief

Download the OctoStack Info Brief for a quick overview and easily share with colleagues.
webinar icon in OctoAI blue

GenAI in your environment

Watch the webinar to get an understanding of how OctoStack helps you overcome the complexities of implementing GenAI in your stack.
webinar icon in OctoAI blue

Bring GenAI to your Datastore

Watch the webinar to learn how to build data workflows on your Snowflake data with OctoStack, leverage RAG, and use GenAI in your data pipeline to enrich your business data.

Frequently asked questions

Don’t see the answer to your question here? Feel free to reach out so we can help.

How much does OctoStack cost?
Reach out to us to talk about the details of your environment and requirements.
What level of security and reliability are available with OctoStack?
OctoAI is SOC 2 Type II certified, and OctoStack runs in your environment allowing you full control of your data and pipelines.
What GPU hardware should we run for language models?

Smaller models (7B and 8B) can run on NVIDIA A10G GPUs, while the minimum recommendation for larger models (70B or 8x7B) are two A100 or H100 GPUs. These GPUs tend to provide the best throughput and latency. If you are unsure about your hardware capabilities please contact us, and our experts will work to understand your requirements and current resources.

Can I run OctoStack in disconnected (airgapped) mode?

Yes, OctoStack is designed to be able to support deployment within customer environment, including environments with no connectivity to the Internet.

How does auto-scaling work in OctoStack?

OctoStack emits metrics allowing you to decide when to scale GPU's up and down. These metrics include the number of pending and in-flight requests along with requests per second.

How does load balancing work in OctoStack?

OctoStack includes an advanced load balancing solution to increase GPU throughput and utilization. Our system is designed for generative AI workloads, and can manage bot homogenous and heterogenous workload profiles.

Do I have access to OctoAI features like JSON-mode on OctoStack?

Yes, OctoStack runs the same OctoAI serving stack as our SaaS API endpoint, and has the same capabilities available.

What options are available for using embeddings or building retrieval augmented generation (RAG) workflows?

We have integrations and partnerships with industry leading services, including LangChain, Unstructured, LlamaIndex, Pinecone, and others. All of these integrations are available in OctoStack.

GenAI in your environment with optimized performance

Your choice of models, while controlling your data and utilizing OctoAI's world-class inference optimization.

Request a Demo Today