Self-Hosting AI Models: A Practical Guide to Building Your Own Stack

GPU server rack in a dark blue data centre environment, representing self-hosted AI model inference infrastructure

Many organizations now want full ownership of their AI infrastructure. The motivation for self-hosting ranges from data ownership requirements and contractual obligations to maintaining the highest level of system security.

This post covers the infrastructure decisions, model selection tradeoffs, and performance optimization techniques we encountered while building a self-hosted multi-model inference stack. Security architecture and model licensing are out of scope here as both deserve their own deep dives. Still, everything about building the infrastructure and making it perform is fair game.

Here’s what our stack looked like:

  • An open-source inference engine (vLLM production stack)
  • A multi open-weight model setup
  • Accelerated computing instances on AWS
  • A scalable, highly available EKS cluster

vLLM as an inference engine

There are several open-source inference engines to choose from, including LMDeploy, SGLang, and TensorRT-LLM.

We chose vLLM for its performance, broad model support, extensive documentation, and built-in multi-model type routing.

Their production stack ships with an infrastructure diagram you can extend for your own setup, but the core components are:

Request router

An OpenAI-compatible API layer. It uses prefix-aware routing to direct repeat context to the same worker, reducing time to first token. In a multi-model setup, the router handles requests by endpoint, model name, and worker assignment.

Workers

vLLM instances running on GPU nodes. The stack handles tensor parallelism across multiple GPUs for large models out of the box.

KV cache storage

In a multi-worker setup, previously computed state is retrieved from LMCache, which delivers significant performance gains, especially for models like GPT OSS.

Observability stack

Prometheus and Grafana for monitoring.

Simplified end-to-end flow

The request router analyzes the incoming prompt’s prefix and directs it to a worker that already holds that context in memory. The worker processes the request with optimized block-based memory management, pulling previously computed states from a per-node or cluster-wide cache, and generates the response.

Choosing the right hosting environment

The AI hosting landscape is competitive. The vLLM production stack has cloud deployment support for AWS, Azure, and GCP, and project velocity matters a lot at this stage.

This is why we chose AWS EKS. The cost savings from alternative providers did not justify the increased setup complexity.

Specialty cloud hosting providers are cheaper, but they often offer unmanaged environments. That means you handle all the heavy lifting yourself, like networking, orchestration, GPU scheduling, the lot.

On-premise considerations

Buying hardware immediately is an operational risk. This is true even if you have predictable workloads.

We recommend a phased approach:

Phase 1: Model PoC

Optional if you already know the model you want. Use managed services like AWS Bedrock to find the sweet spot between model size and reasoning capability. The open-weight model catalogue is expanding fast and the setup is minimal.

Phase 2: Cloud PoC

Use cloud-managed Kubernetes to prototype your multi-model infrastructure. Test different GPU offerings, benchmark your setup, and figure out your TPM and RPM requirements. Test your open-source model choices without locking into expensive hardware early.

Phase 3: On-premise refinement

Once you understand your patterns and limits, modify your existing Kubernetes cluster for an on-premise deployment. This is significantly easier than starting here from scratch.

Choosing the right model

To simplify the equation: the two factors that drive infrastructure cost are model size (parameter count) and model context (the active memory window containing your conversation and retrieved data).

The LLM is your main challenge. Embedding and reranking models require comparatively little GPU power.

Here are three scenarios to illustrate the range. Note that these are rough on-demand estimates, and be sure to check current pricing and consider reserved or spot instances where applicable.

Small: chatbot with basic interactions

Customer support, simple Q&A. No complex reasoning or large context required.

  • Size: 7B or 8B parameters
  • Context: 2k–20k tokens
  • OSS models: Llama 3 (8B), Mistral (7B), Qwen (7B)
  • Proprietary use case equivalents: GPT-4o-mini, Claude Haiku, Gemini Flash-Lite
  • Infrastructure: A single G6e family instance
  • Monthly cost: ~$400–$600

Medium: reasoning over a knowledge base

Internal knowledge bases where the model reads retrieved company documents, follows strict instructions, and needs to minimize hallucinations.

  • Size: 70B parameters
  • Context: 20k–50k tokens
  • OSS models: Llama 3 (70B), Mixtral (8x7B), Qwen (72B), GPT-OSS-20B
  • Proprietary use case equivalents: Claude Sonnet, Gemini Flash
  • Infrastructure: Multi-GPU setup
  • Monthly cost: ~$3k–$8k

Large: high accuracy, high reasoning, high context

Complex code refactoring, massive document analysis, predictions, and advanced agents. Maximum accuracy and minimal hallucinations are non-negotiable.

  • Size: 100B+ parameters
  • Context: 50k+ tokens
  • OSS models: GPT-OSS-120B, DeepSeek-R1, Mistral Large 3
  • Proprietary use case equivalents: GPT-5, Claude Opus, Gemini Pro
  • Infrastructure: p5e.48xlarge instances (8×H200)
  • Monthly cost: ~$30k+

These are rough single-environment estimates. Multi-environment, highly available enterprise setups multiply these figures quickly.

Benchmarking your setup

Although there are fast general benchmarking tools available, like LLMfit, you should measure model performance in your own environment. This also reveals hardware traps that generic benchmarks won’t surface.

For example, adding more L40S GPUs may not increase performance. These GPUs communicate over the PCIe bus instead of NVLink, and the communication overhead can cancel out the compute gains.

vLLM has a native benchmarking option via the bench serve command. The key metrics to watch:

MetricMeaning
Median TTFT (Time to First Token)How long from prompt submission to the first generated token. The user’s perceived responsiveness.
Median TPOT (Time Per Output Token)How long each subsequent token takes to generate.
Median ITL (Inter-Token Latency)The gap between consecutive tokens. Smoothness of streaming output.
Output token throughputTokens generated per second across all concurrent users.
Total token throughputCombined rate for both prompt processing and generation.
Request throughputComplete requests resolved per second.
Max request concurrencyPeak number of simultaneous requests handled during the test.

Optimization techniques

There is extensive documentation on optimization techniques. Here’s a summary of those that made the biggest difference for us.

Reduces weight precision (e.g., from 16-bit to 8-bit or 4-bit) to shrink the model’s memory footprint. This has a direct impact on what model you can fit on your available hardware.

Worker/Node level memory management. Caches the KV state of existing queries. If you’re querying the same long document multiple times, the document is processed once and subsequent queries pull from cache. The result is higher throughput and lower latency.

Distributed caching via LMCache

Automatic prefix caching is limited to a single worker’s GPU VRAM — extremely fast, but expensive. LMCache enables cluster-wide offloading to cheaper storage (CPU memory, disk, or Redis) at the cost of some latency. Use both in a tiered memory hierarchy for the best balance.

Workload distribution. Splits tensors across multiple GPUs. Effectively a requirement for larger models. Performance depends heavily on fast interconnects like NVLink.

There are multiple methods. One approach pairs a large model with a tiny, fast model. The fast model guesses the next tokens, and the large model verifies them in a single pass. This multiplies token generation speed.

Separates the prefill and decode stages onto different GPUs or nodes. Since the two tasks have different computational profiles (compute-bound vs. memory-bandwidth-bound), you can scale each independently — either to improve responsiveness or to prevent long prompts from stalling active generation.

What’s ahead for self-hosting

There will always be demand for on-premise self-hosted AI in systems that require maximum control over their data.

The barrier to entry is dropping. Inference engines are maturing, optimization techniques are compounding, and models are getting better with fewer parameters and lower VRAM requirements. The recent Gemma 4 release is a good example: judging by the benchmarks, it delivers strong performance for a modest hardware investment. Stay tuned for a deep dive on that one.

In conclusion, enterprise-grade self-hosting remains expensive, but the trajectory is clear: organizations will be able to do significantly more with significantly less hardware.

The phased approach we outlined here is designed to let you start proving value now, without committing to infrastructure you don’t yet understand.