What is vLLM and why is it used for self-hosted AI inference?

vLLM is an open-source inference engine for large language models, chosen over alternatives like LMDeploy, SGLang, and TensorRT-LLM for its performance, broad model support, extensive documentation, and built-in multi-model type routing. Its production stack includes an OpenAI-compatible API router with prefix-aware routing, GPU worker nodes with native tensor parallelism support, KV cache storage via LMCache for multi-worker environments, and Prometheus/Grafana observability. It ships with infrastructure diagrams that can be extended for custom multi-model setups.

What is the recommended approach to self-hosting AI on-premise?

A phased approach is recommended to avoid locking into expensive hardware prematurely. Phase 1: use managed services like AWS Bedrock to test model options and find the right size and reasoning capability before committing to infrastructure. Phase 2: use cloud-managed Kubernetes (e.g. AWS EKS) to prototype your multi-model setup, benchmark GPU requirements, and understand TPM/RPM demands. Phase 3: once you understand your usage patterns, migrate your existing Kubernetes cluster to an on-premise deployment. Starting directly with on-premise hardware represents significant operational risk without the data from Phases 1 and 2.

How much does it cost to self-host an LLM?

Costs vary significantly by model size. Small models (7B–8B parameters, e.g. Llama 3 8B, Mistral 7B) suitable for customer support and simple Q&A run on a single G6e instance at approximately $400–600/month. Medium models (70B parameters, e.g. Llama 3 70B, Qwen 72B) suitable for internal knowledge bases and RAG require a multi-GPU setup at approximately $3,000–8,000/month. Large models (100B+ parameters, e.g. GPT-OSS-120B, DeepSeek-R1) for complex reasoning, code refactoring, and advanced agents require p5e.48xlarge instances with 8×H200 GPUs at approximately $30,000+/month. These are single-environment estimates — multi-environment enterprise setups multiply these figures quickly.

What LLM optimization techniques have the biggest impact on inference performance?

The six optimization techniques with the biggest practical impact are: quantization (reducing weight precision from 16-bit to 8-bit or 4-bit to shrink model memory footprint, directly affecting which models fit on available hardware); automatic prefix caching (worker-level KV state caching that avoids reprocessing repeated context like long documents, improving throughput and reducing latency); distributed caching via LMCache (cluster-wide offloading to CPU memory, disk, or Redis for cheaper tiered storage); tensor parallelism (splitting tensors across multiple GPUs — essential for large models, dependent on fast NVLink interconnects); speculative decoding (pairing a large model with a fast small model that predicts tokens verified by the large model in a single pass, multiplying generation speed); and disaggregated prefilling (separating compute-bound prefill from memory-bandwidth-bound decode onto different GPUs, enabling independent scaling of each stage).

What metrics should you benchmark when evaluating a self-hosted LLM setup?

The key benchmarking metrics for a self-hosted LLM inference setup are: Median TTFT (Time to First Token) — user-perceived responsiveness from prompt submission to first generated token; Median TPOT (Time Per Output Token) — latency per subsequent token; Median ITL (Inter-Token Latency) — gap between consecutive tokens affecting streaming smoothness; Output token throughput — tokens generated per second across all concurrent users; Total token throughput — combined rate for prompt processing and generation; Request throughput — complete requests resolved per second; and Max request concurrency — peak simultaneous requests handled during the test. Generic benchmarking tools like LLMfit are a starting point, but you should benchmark in your own environment to surface hardware-specific traps — for example, L40S GPUs communicating over PCIe rather than NVLink can see communication overhead cancel out additional compute from adding more GPUs.

Self-Hosting AI Models: A Practical Guide to Building Your Own Stack

Organizations self-host AI models to maintain data ownership, meet contractual obligations, and exercise control over security, but the resulting infrastructure complexity is significant. This guide documents a production implementation using vLLM on AWS EKS, covering GPU instance selection, quantization, automatic prefix caching, distributed caching via LMCache, tensor parallelism, speculative decoding, and disaggregated prefilling. Infrastructure costs range from $400/month for 7B-parameter models to $30,000/month for 100B+ parameter deployments using open-weight models, including Llama 3, Mistral, and DeepSeek-R1.

GPU server rack in a dark blue data centre environment, representing self-hosted AI model inference infrastructure

Vjekoslav Drakšić

DevOps Engineer

7 min Read in AI & data, Innovation

Published Apr 16, 2026

Many organizations now want full ownership of their AI infrastructure. The motivation for self-hosting ranges from data ownership requirements and contractual obligations to maintaining the highest level of system security.

This post covers the infrastructure decisions, model selection tradeoffs, and performance optimization techniques we encountered while building a self-hosted multi-model inference stack. Security architecture and model licensing are out of scope here as both deserve their own deep dives. Still, everything about building the infrastructure and making it perform is fair game.

Here’s what our stack looked like:

An open-source inference engine (vLLM production stack)
A multi open-weight model setup
Accelerated computing instances on AWS
A scalable, highly available EKS cluster

vLLM as an inference engine

There are several open-source inference engines to choose from, including LMDeploy, SGLang, and TensorRT-LLM.

We chose vLLM for its performance, broad model support, extensive documentation, and built-in multi-model type routing.

Their production stack ships with an infrastructure diagram you can extend for your own setup, but the core components are:

Request router

An OpenAI-compatible API layer. It uses prefix-aware routing to direct repeat context to the same worker, reducing time to first token. In a multi-model setup, the router handles requests by endpoint, model name, and worker assignment.

Workers

vLLM instances running on GPU nodes. The stack handles tensor parallelism across multiple GPUs for large models out of the box.

KV cache storage

In a multi-worker setup, previously computed state is retrieved from LMCache, which delivers significant performance gains, especially for models like GPT OSS.

Observability stack

Prometheus and Grafana for monitoring.

Simplified end-to-end flow

The request router analyzes the incoming prompt’s prefix and directs it to a worker that already holds that context in memory. The worker processes the request with optimized block-based memory management, pulling previously computed states from a per-node or cluster-wide cache, and generates the response.

Choosing the right hosting environment

The AI hosting landscape is competitive. The vLLM production stack has cloud deployment support for AWS, Azure, and GCP, and project velocity matters a lot at this stage.

This is why we chose AWS EKS. The cost savings from alternative providers did not justify the increased setup complexity.

Specialty cloud hosting providers are cheaper, but they often offer unmanaged environments. That means you handle all the heavy lifting yourself, like networking, orchestration, GPU scheduling, the lot.

On-premise considerations

Buying hardware immediately is an operational risk. This is true even if you have predictable workloads.

We recommend a phased approach:

Phase 1: Model PoC

Optional if you already know the model you want. Use managed services like AWS Bedrock to find the sweet spot between model size and reasoning capability. The open-weight model catalogue is expanding fast and the setup is minimal.

Phase 2: Cloud PoC

Use cloud-managed Kubernetes to prototype your multi-model infrastructure. Test different GPU offerings, benchmark your setup, and figure out your TPM and RPM requirements. Test your open-source model choices without locking into expensive hardware early.

Phase 3: On-premise refinement

Once you understand your patterns and limits, modify your existing Kubernetes cluster for an on-premise deployment. This is significantly easier than starting here from scratch.

Choosing the right model

To simplify the equation: the two factors that drive infrastructure cost are model size (parameter count) and model context (the active memory window containing your conversation and retrieved data).

The LLM is your main challenge. Embedding and reranking models require comparatively little GPU power.

Here are three scenarios to illustrate the range. Note that these are rough on-demand estimates, and be sure to check current pricing and consider reserved or spot instances where applicable.

Small: chatbot with basic interactions

Customer support, simple Q&A. No complex reasoning or large context required.

Size: 7B or 8B parameters
Context: 2k–20k tokens
OSS models: Llama 3 (8B), Mistral (7B), Qwen (7B)
Proprietary use case equivalents: GPT-4o-mini, Claude Haiku, Gemini Flash-Lite
Infrastructure: A single G6e family instance
Monthly cost: ~$400–$600

Medium: reasoning over a knowledge base

Internal knowledge bases where the model reads retrieved company documents, follows strict instructions, and needs to minimize hallucinations.

Size: 70B parameters
Context: 20k–50k tokens
OSS models: Llama 3 (70B), Mixtral (8x7B), Qwen (72B), GPT-OSS-20B
Proprietary use case equivalents: Claude Sonnet, Gemini Flash
Infrastructure: Multi-GPU setup
Monthly cost: ~$3k–$8k

Large: high accuracy, high reasoning, high context

Complex code refactoring, massive document analysis, predictions, and advanced agents. Maximum accuracy and minimal hallucinations are non-negotiable.

Size: 100B+ parameters
Context: 50k+ tokens
OSS models: GPT-OSS-120B, DeepSeek-R1, Mistral Large 3
Proprietary use case equivalents: GPT-5, Claude Opus, Gemini Pro
Infrastructure: p5e.48xlarge instances (8×H200)
Monthly cost: ~$30k+

These are rough single-environment estimates. Multi-environment, highly available enterprise setups multiply these figures quickly.

Benchmarking your setup

Although there are fast general benchmarking tools available, like LLMfit, you should measure model performance in your own environment. This also reveals hardware traps that generic benchmarks won’t surface.

For example, adding more L40S GPUs may not increase performance. These GPUs communicate over the PCIe bus instead of NVLink, and the communication overhead can cancel out the compute gains.

vLLM has a native benchmarking option via the bench serve command. The key metrics to watch:

Metric	Meaning
Median TTFT (Time to First Token)	How long from prompt submission to the first generated token. The user’s perceived responsiveness.
Median TPOT (Time Per Output Token)	How long each subsequent token takes to generate.
Median ITL (Inter-Token Latency)	The gap between consecutive tokens. Smoothness of streaming output.
Output token throughput	Tokens generated per second across all concurrent users.
Total token throughput	Combined rate for both prompt processing and generation.
Request throughput	Complete requests resolved per second.
Max request concurrency	Peak number of simultaneous requests handled during the test.

Optimization techniques

There is extensive documentation on optimization techniques. Here’s a summary of those that made the biggest difference for us.

Quantization

Reduces weight precision (e.g., from 16-bit to 8-bit or 4-bit) to shrink the model’s memory footprint. This has a direct impact on what model you can fit on your available hardware.

Automatic prefix caching

Worker/Node level memory management. Caches the KV state of existing queries. If you’re querying the same long document multiple times, the document is processed once and subsequent queries pull from cache. The result is higher throughput and lower latency.

Distributed caching via LMCache

Automatic prefix caching is limited to a single worker’s GPU VRAM — extremely fast, but expensive. LMCache enables cluster-wide offloading to cheaper storage (CPU memory, disk, or Redis) at the cost of some latency. Use both in a tiered memory hierarchy for the best balance.

Tensor parallelism

Workload distribution. Splits tensors across multiple GPUs. Effectively a requirement for larger models. Performance depends heavily on fast interconnects like NVLink.

Speculative decoding

There are multiple methods. One approach pairs a large model with a tiny, fast model. The fast model guesses the next tokens, and the large model verifies them in a single pass. This multiplies token generation speed.

Disaggregated prefilling

Separates the prefill and decode stages onto different GPUs or nodes. Since the two tasks have different computational profiles (compute-bound vs. memory-bandwidth-bound), you can scale each independently — either to improve responsiveness or to prevent long prompts from stalling active generation.

What’s ahead for self-hosting

There will always be demand for on-premise self-hosted AI in systems that require maximum control over their data.

The barrier to entry is dropping. Inference engines are maturing, optimization techniques are compounding, and models are getting better with fewer parameters and lower VRAM requirements. The recent Gemma 4 release is a good example: judging by the benchmarks, it delivers strong performance for a modest hardware investment. Stay tuned for a deep dive on that one.

In conclusion, enterprise-grade self-hosting remains expensive, but the trajectory is clear: organizations will be able to do significantly more with significantly less hardware.

The phased approach we outlined here is designed to let you start proving value now, without committing to infrastructure you don’t yet understand.