Walkthrough of model selection framework
For GenAI app developers, a crucial step on the path from prototype to production is choosing the right LLM. Unfortunately, it isn’t as simple as a trip to HuggingFace to find the trendiest model, spinning up an inference API, and running some tests. Model selection is just a single step on the long and winding "path to production." And as with traditional software development, getting it right requires careful scoping and preparation. As the saying goes, "failure to plan is a plan to fail."
But where to start when there’s no established playbook? GenAI developers are creating an entirely new class of applications — each with unique requirements — all while the core building blocks (LLMs) are evolving at warp speed. The challenge is especially pronounced for anyone developing with open source models. Some may opt to build with a large, generalized LLMs as a means of sidestepping this predicament altogether, but these mega models come with a high price tag and limited privacy and configurability, meaning they’re not a great fit for many apps.
In this post the OctoAI team shares what we’ve learned from working alongside customers with diverse use cases and requirements to identify the right LLM for their application. This universal framework can help you develop your own effective model selection criteria, scope your requirements, and build a shortlist of the most promising LLMs to evaluate for your app. While no two model selection decision-trees are identical, any company can apply these filters to their unique use case.
Filter 1: Outcomes
Getting crystal clear on your vision will ease the glide path for the rest of the model selection process. That way, you can work backwards from the experience you want to deliver. How will the LLM make the experience compelling? What behavior is essential to deliver on that experience?
Perhaps you need the model to use a specific tone, allow users to take action on data, or speak in a variety of languages — whatever your need, aligning the model capabilities with your goals is a critical first step. This will not only help you make the first “cut” of models, but also inform how you set up your data pipelines and integrations.
If you want... | You may need... |
---|---|
Novel end use expereinces (i.e. assistants, games, creativity) | - LoRA Tuning - Multimodality |
Enable business users to take action on data | - Large context window - Promots to introduce data - RAG |
Structured output for application workflows | - JSON formatting - Function calling |
Specialized tasks (i.e. coding, creative writing, non-English language) | - Community checkpoints - Codegen - World language models |
Filter 2: Constraints
By this step you’ve gained a better understanding of your functional requirements and the techniques required to implement your LLM for the desired behavior. Next, you’ll use business and technical constraints to further refine the list and scope some of your architectural choices, including deployment targets. Constraints to consider include:
Cost: Determine your budget and the cost constraints for your project.
Fault Tolerance: Assess the fault tolerance required for your application.
Privacy: Ensure that the model can handle your data privacy requirements.
Inputs: What kind of data do I expect the LLM to act on?
SLAs (Service Level Agreements): Define the latency and scale requirements.
In a practical sense, here’s how your constraints might map to architectural priorities:
If you have this constraint... | Prioritize these... |
---|---|
Model must call proprietary data | Data control, observability, privacy |
90 tokens/sec LLM response time | High throughput/low latency |
Maintain <$0.50 / 1 million output tokens | Smaller model size, cost/latency tradeoffs |
Serve 1 billion tokens daily with 99% feature reliability (LLM uptime) | Enterprise-scale infrastructure |
95% task accuracy | Model quality, ease of customization, longer context windows |
Filter 3: Human feedback (AKA the vibe check)
Model evaluation is a complex topic that warrants its own dedicated post — no, a series of dedicated posts. To conduct an effective model evaluation you’ll need to curate high quality data that faithfully represents input and output for each task the LLM must perform. You must develop strong evaluation criteria for each LLM sub-task, mechanisms for human-led and LLM-automated A/B testing, and metrics for all of that. But before you go there, you must first check…the vibes.
In this deeply unscientific (yet highly effective) process you’ll run a few of your shortlisted models through your prompts and data pipeline to approximate your end-user’s experience. You’ll get a sense for whether you’re within striking distance of your desired outcome(s) and spend time tweaking your system prompts until you feel like things are close.
From there, you can undergo more rigorous unit testing against your shortlist of models against your functional requirements and application constraints with higher confidence.
Conclusion
Selecting the right LLM is a critical step towards achieving your application goals. By understanding your constraints, aligning with desired outcomes, and utilizing a robust evaluation process, you can make an informed decision. OctoAI is here to assist you every step of the way, providing the tools and expertise needed to navigate the complex landscape of model selection.
For more information on getting started with OctoAI, visit our SaaS Signup page, request a Proof of Concept for OctoStack, or join our Discord Community.