Many organizations now want full ownership of their AI infrastructure. The motivation for self-hosting ranges from data ownership requirements and contractual obligations to maintaining the highest level of system security.
This post covers the infrastructure decisions, model selection tradeoffs, and performance optimization techniques we encountered while building a self-hosted multi-model inference stack. Security architecture and model licensing are out of scope here as both deserve their own deep dives. Still, everything about building the infrastructure and making it perform is fair game.
Here’s what our stack looked like:
- An open-source inference engine (vLLM production stack)
- A multi open-weight model setup
- Accelerated computing instances on AWS
- A scalable, highly available EKS cluster
vLLM as an inference engine
There are several open-source inference engines to choose from, including LMDeploy, SGLang, and TensorRT-LLM.
We chose vLLM for its performance, broad model support, extensive documentation, and built-in multi-model type routing.
Their production stack ships with an infrastructure diagram you can extend for your own setup, but the core components are:
Request router
An OpenAI-compatible API layer. It uses prefix-aware routing to direct repeat context to the same worker, reducing time to first token. In a multi-model setup, the router handles requests by endpoint, model name, and worker assignment.
Workers
vLLM instances running on GPU nodes. The stack handles tensor parallelism across multiple GPUs for large models out of the box.
Observability stack
Prometheus and Grafana for monitoring.
Simplified end-to-end flow
The request router analyzes the incoming prompt’s prefix and directs it to a worker that already holds that context in memory. The worker processes the request with optimized block-based memory management, pulling previously computed states from a per-node or cluster-wide cache, and generates the response.
Choosing the right hosting environment
The AI hosting landscape is competitive. The vLLM production stack has cloud deployment support for AWS, Azure, and GCP, and project velocity matters a lot at this stage.
This is why we chose AWS EKS. The cost savings from alternative providers did not justify the increased setup complexity.
Specialty cloud hosting providers are cheaper, but they often offer unmanaged environments. That means you handle all the heavy lifting yourself, like networking, orchestration, GPU scheduling, the lot.
On-premise considerations
Buying hardware immediately is an operational risk. This is true even if you have predictable workloads.
We recommend a phased approach:
Phase 1: Model PoC
Optional if you already know the model you want. Use managed services like AWS Bedrock to find the sweet spot between model size and reasoning capability. The open-weight model catalogue is expanding fast and the setup is minimal.
Phase 2: Cloud PoC
Use cloud-managed Kubernetes to prototype your multi-model infrastructure. Test different GPU offerings, benchmark your setup, and figure out your TPM and RPM requirements. Test your open-source model choices without locking into expensive hardware early.
Phase 3: On-premise refinement
Once you understand your patterns and limits, modify your existing Kubernetes cluster for an on-premise deployment. This is significantly easier than starting here from scratch.
Choosing the right model
To simplify the equation: the two factors that drive infrastructure cost are model size (parameter count) and model context (the active memory window containing your conversation and retrieved data).
The LLM is your main challenge. Embedding and reranking models require comparatively little GPU power.
Here are three scenarios to illustrate the range. Note that these are rough on-demand estimates, and be sure to check current pricing and consider reserved or spot instances where applicable.
Small: chatbot with basic interactions
Customer support, simple Q&A. No complex reasoning or large context required.
- Size: 7B or 8B parameters
- Context: 2k–20k tokens
- OSS models: Llama 3 (8B), Mistral (7B), Qwen (7B)
- Proprietary use case equivalents: GPT-4o-mini, Claude Haiku, Gemini Flash-Lite
- Infrastructure: A single G6e family instance
- Monthly cost: ~$400–$600
Medium: reasoning over a knowledge base
Internal knowledge bases where the model reads retrieved company documents, follows strict instructions, and needs to minimize hallucinations.
- Size: 70B parameters
- Context: 20k–50k tokens
- OSS models: Llama 3 (70B), Mixtral (8x7B), Qwen (72B), GPT-OSS-20B
- Proprietary use case equivalents: Claude Sonnet, Gemini Flash
- Infrastructure: Multi-GPU setup
- Monthly cost: ~$3k–$8k
Large: high accuracy, high reasoning, high context
Complex code refactoring, massive document analysis, predictions, and advanced agents. Maximum accuracy and minimal hallucinations are non-negotiable.
- Size: 100B+ parameters
- Context: 50k+ tokens
- OSS models: GPT-OSS-120B, DeepSeek-R1, Mistral Large 3
- Proprietary use case equivalents: GPT-5, Claude Opus, Gemini Pro
- Infrastructure: p5e.48xlarge instances (8×H200)
- Monthly cost: ~$30k+
These are rough single-environment estimates. Multi-environment, highly available enterprise setups multiply these figures quickly.
Benchmarking your setup
Although there are fast general benchmarking tools available, like LLMfit, you should measure model performance in your own environment. This also reveals hardware traps that generic benchmarks won’t surface.
For example, adding more L40S GPUs may not increase performance. These GPUs communicate over the PCIe bus instead of NVLink, and the communication overhead can cancel out the compute gains.
vLLM has a native benchmarking option via the bench serve command. The key metrics to watch:
| Metric | Meaning |
|---|---|
| Median TTFT (Time to First Token) | How long from prompt submission to the first generated token. The user’s perceived responsiveness. |
| Median TPOT (Time Per Output Token) | How long each subsequent token takes to generate. |
| Median ITL (Inter-Token Latency) | The gap between consecutive tokens. Smoothness of streaming output. |
| Output token throughput | Tokens generated per second across all concurrent users. |
| Total token throughput | Combined rate for both prompt processing and generation. |
| Request throughput | Complete requests resolved per second. |
| Max request concurrency | Peak number of simultaneous requests handled during the test. |
Optimization techniques
There is extensive documentation on optimization techniques. Here’s a summary of those that made the biggest difference for us.
Reduces weight precision (e.g., from 16-bit to 8-bit or 4-bit) to shrink the model’s memory footprint. This has a direct impact on what model you can fit on your available hardware.
Worker/Node level memory management. Caches the KV state of existing queries. If you’re querying the same long document multiple times, the document is processed once and subsequent queries pull from cache. The result is higher throughput and lower latency.
Distributed caching via LMCache
Automatic prefix caching is limited to a single worker’s GPU VRAM — extremely fast, but expensive. LMCache enables cluster-wide offloading to cheaper storage (CPU memory, disk, or Redis) at the cost of some latency. Use both in a tiered memory hierarchy for the best balance.
Workload distribution. Splits tensors across multiple GPUs. Effectively a requirement for larger models. Performance depends heavily on fast interconnects like NVLink.
There are multiple methods. One approach pairs a large model with a tiny, fast model. The fast model guesses the next tokens, and the large model verifies them in a single pass. This multiplies token generation speed.
Separates the prefill and decode stages onto different GPUs or nodes. Since the two tasks have different computational profiles (compute-bound vs. memory-bandwidth-bound), you can scale each independently — either to improve responsiveness or to prevent long prompts from stalling active generation.
What’s ahead for self-hosting
There will always be demand for on-premise self-hosted AI in systems that require maximum control over their data.
The barrier to entry is dropping. Inference engines are maturing, optimization techniques are compounding, and models are getting better with fewer parameters and lower VRAM requirements. The recent Gemma 4 release is a good example: judging by the benchmarks, it delivers strong performance for a modest hardware investment. Stay tuned for a deep dive on that one.
In conclusion, enterprise-grade self-hosting remains expensive, but the trajectory is clear: organizations will be able to do significantly more with significantly less hardware.
The phased approach we outlined here is designed to let you start proving value now, without committing to infrastructure you don’t yet understand.