OctoStack overview

OctoAI's GenAI production stack in your environment.


OctoStack is a turnkey generative AI solution allowing you to run your choice of models in your environment. It delivers OctoAI’s industry-leading AI inference service and performance optimizations with the privacy benefits of self-managing within your environment. OctoStack provides a full stack solution for running generative AI at scale - including inference, model customization, load balancing, auto-scaling, and telemetry. OctoStack is compatible with all cloud service providers and container orchestration services. Contact us to get started!

Key Benefits

Privately deploy in any environment

OctoStack is designed for portability across any cloud platform or on-premise data center. Prebuilt containers are easily deployed using Kubernetes or your preferred container orchestration service. Maintain complete data privacy and control by processing all generative AI workloads within your environment.

Performance optimized inference

OctoStack includes OctoAI’s performance optimizations delivered through Apache TVM, a compiler framework that optimizes and accelerates inferences. Optimized inference improves GPU utilization and delivers the best possible application experience. OctoStack also provides best-in-class load balancing to operate at scale.

Easy to use APIs & customization

Application developers can use OctoStack’s industry-standard API’s, including Python and TypeScript SDK’s. Our ergonomic API’s allow rapid development and integration. You can easily load and run open source models, and fine-tuned checkpoints.

Configuration & Deployment

GPU Configuration

OctoStack can run a multi-GPU configuration to improve throughput and latency, which is easily enabled with a single configuration setting. We recommend using multiple GPU’s per replica when running models with higher precision and a greater number of parameters. The OctoAI team can help you identify the multi-GPU configuration that best matches your throughput & latency goals.


OctoStack is comptable with all cloud environments - these are key system requirements:

  • Kubernetes, or Docker & Docker Compose
  • Redis in cluster mode
  • NVIDIA drivers
  • GPUs
    • OctoStack supports a range of hardware including A10G, A100, and H100 GPUs
    • A100s such as AWS p4d.24xlarge instances can support a broad set of models


OctoStack supports Kubernetes deployment via manifest files, and a Docker Compose deployment compatible with any container orchestration service. There’s a few simple steps to deploy:

  1. OctoAI allow lists your AWS account, so you can access OctoStack containers in OctoAI’s AWS ECR.
  2. Pull the OctoStack containers from OctoAI’s AWS ECR.
  3. Configure and deploy using Kubernetes or Docker Compose, using OctoAI-provided guides.

Get Started

Reach out to the team to see a live demo and get started on your OctoStack deployment.