OctoAI: secure, reconfigurable, natively multimodal

OctoAI’s next-generation large multimodal (LMM) inference engine is built on open technology and optimized for use in the enterprise.

The release of Llama 3.1 405B marks an important milestone: open-source models are as good as proprietary models without qualification. As Mark Zuckerberg shared on Meta’s blog, open models will be the “Linux kernel” of AI. This aligns with our long-held point of view on the open-source software (OSS) AI ecosystem and mirrors the evolution of previous OSS waves.

Open development of critical infrastructure is an important part of building a high-value commercial offering, but not sufficient alone for enterprise needs. Commercial Linux “distributions” like Ubuntu and Red Hat developed and packaged compilers, desktop environments, package managers, and numerous other tools that built on the core kernel and made it a product.

We want to do the same for AI.

At OctoAI we are focused on building on a strong open-source foundation to deliver AI “distributions” to users. Truly valuable GenAI applications require cross-stack, interoperable services to power inference, fine-tuning, and evaluation capabilities. We are working towards this vision in open source, our commercial products, and partnerships.

OctoAI’s multimodal inference engine delivers value to customers that goes beyond the model endpoint, aligning closely to the essential needs of enterprise customers:

Inference is the cornerstone of the GenAI flywheel but it must be designed to support the broader organizational needs, including fine-tuning, evaluation, routing, etc.
Balancing of unit economics with system requirements for latency, throughput, and quality
Customization is key to solving many problems by adapting a model or set of models
Transparency and control driven by open source innovations
Flexible and adaptable support for a full range of enterprise tools, data types, and use cases without burdensome engineering overhead

Inference is the cornerstone of the AI flywheel

Inference is central to the “model lifecycle.” Customers may build their first product experience around inference with a single model, but inference continues to be part of every production deployment and model evaluation from product inception to full maturity.
Our inference engine is designed to evolve with customers along a journey from one off-the-shelf model to a herd of customized models. Our engine enables this by accepting new models such as Llama 3.1 405B (which we supported in less than 48 hours of wall clock time), new parameter efficient fine tunes (PEFTs) and new PEFT formats, the ability to record user traffic for offline dataset construction for evaluation and fine-tuning, support for content filtering, and more.
Finally we view the operational data produced by inference, and associated user feedback or other telemetry as an emerging asset for businesses.

Unit economics

We have designed a solution that can efficiently run dozens of PEFTs on a single node.
Enable running in environments other than the state-of-the-art Nvidia GPUs, such as older NVIDIA GPU families (A100, A10G, etc.) and non-NVIDIA GPUs.
Throughput and multi-tenancy are key pillars of what we do, peak inference latency at an acceptable price.

Customization is key

Companies shouldn’t have to build or maintain their own inference stack to retain control of their inference environment.
Give users control over what models they run, what PEFTs they run, which optimizations they turn on or off.
The ability to make conscious tradeoffs such as balancing latency vs. throughput depending on their application, user SLA, and budget.

Transparency and Control through OSS Innovation

As long-time OSS trailblazers, and creators of XGBoost, Apache TVM and MLC-LLM, we wholeheartedly agree with Meta’s recent statement that open source AI is a net positive for the industry, delivering greater visibility and control over the model itself, the data it utilizes, and the environment it runs in.
MLC-LLM technology is a core part of OctoAI’s inference engine, offering best-in-class open performance, runs on a variety of devices—including mobile phones, servers, web browsers—and is open for all to extend and build on.
A global community of innovators including OctoAI, Carnegie Mellon University, University of Washington, and continuously add value with new optimizations, which are then passed along to customers.

Flexible architecture to support enterprise needs

We believe enterprises shouldn’t have to retrofit their existing workflows to accommodate generative AI. Customers of OctoAI SaaS and OctoStack utilize traditional Large Language Models (LLMs), video, image, and voice separately and in multimodal configurations that serve their unique needs. Our flexible architecture supports current and future modalities, tools, datasets, and privacy requirements. To achieve this, the OctoAI inference engine is designed around a few core principles:

Optimizations must compose, meaning any new performance or user experience improvement should be applied across modalities and model formats (e.g. PEFTs).
Dynamic reconfiguration is essential; we must be able to handle multi-tenant/multi-user workloads and support runtime selection of PEFTs and other customization options.
Traditional Large Language Models (LLMs) and Large multimodal Models (LMMs) must co-exist in the same serving framework.
Support for large context sizes. This will be incredibly important in the future, especially in multimodality where images, audio, video can widely grow the input context sizes of prompts. We believe large context sizes will give companies the ability to build first, optimize later by feeding more relevant data to the model up-front, and optimize as they reach scale.

OctoAI inference engine

The service layer handles translating raw user input whether image, audio, text data or urls into a standard format as well as any relevant caching.

The multimodal cache ingests inputs of all formats, and keeps a cache for multi-inference on those assets.

The Engine I/O layer handles translating the logical inputs into the correct relevant formats, applying the stack of pre-processing, and layers required to map them into token space, and then combining the tokenized inputs together before feeding them to the core engine.

The core-engine is responsible for going from an input context of tokens to a stream of output tokens. Inside, the continuous batching engine handles grouping requests and managing the KV Cache. The scheduler handles scheduling of chunked prefill, decode and the relevant model and assets (base model, required LoRAs, draft models for speculative decoding, and soon draft LoRAs as well.

Finally the output token stream is fed back to the Engine I/O layer, which transforms the token stream back into the relevant outputs today whether it is raw text, structured output, tool uses (custom or built-in), and soon multimodal outputs as well.

Architectural highlights

Dynamic behavior is handled by a combination of our continuous batching engine, optimizations for dynamic split-fuse (or chunked prefill), and dynamic kernel selection based on user workload characteristics.
Multimodality requires handling of rich user input (audio, text, video, …) interspersed in the prompt. Text is easy to transmit, copy, and format for the model using simple templates. We have designed integration into our asset manager and engine I/O layer, which handles the caching, pre-processing, and transformation for multimodal inputs for the model—all of which impact performance. Similar to chat template handling for text, these are customizable allowing us to support a wide family of models in a single API, and with the same set of optimizations.
Generalized handling of tool use allows not only user-provided function calling but the ability for models, such as the Llama 3.1 family of models to invoke our other APIs as well (such as Image Generation).
The optimizations in our batching engine enable us to handle very large context sizes on a single node (single node 405B for example) while retaining performance, and LoRA support.
Our quantization techniques are compilation-based, enabling low-overhead “static” quantization, which does not incur unnecessary runtime quantize or dequantize operations and ensures optimal memory use.
Early adoption of constrained sampling for structured output use cases, a feature recently released by OpenAI.
The release of this engine in our SaaS platform and OctoStack includes speculative decoding (built on work in the research community such as EAGLE) on our most popular models. We’re excited to share that speculative decoding will soon be available on user models and LoRAs. This is an interesting engineering challenge in the world of PEFTs and quantization, as the decoding models must be calibrated to match the quantization and the specific user-LoRAs to achieve acceptable performance.
Finally, we are doing all of this while aggressively optimizing performance and enabling reconfiguration.

While some of these individual features can be found in other proprietary or open-source solutions, OctoAI’s inference engine is the only one that brings them together.

We offer a no-cost proof of concept to demonstrate the superior value of OctoAI’s inference engine. You’ll work hand-in-hand with our Solutions Engineers to put OctoStack or our SaaS platform to the test to achieve success for your use case: latency, quality, cost savings, or your specific success criteria. Contact our team to get started.

OctoAI’s Inference Engine: enterprise-grade, dynamically reconfigurable, natively multimodal