<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/">
	<channel>
		<title>Author at Infinum</title>
		<atom:link href="https://infinum.com/blog/author/vjekoslav-draksic/feed/" rel="self" type="application/rss+xml" />
		<link></link>
		<description>Building digital products</description>
		<lastBuildDate>Fri, 17 Apr 2026 13:59:15 +0000</lastBuildDate>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>

					<item>
				<image>
					<url>19278544https://infinum.com/uploads/2026/04/img-hero-selfhosting-llm.webp</url>
				</image>
				<title>Self-Hosting AI Models: A Practical Guide to Building Your Own Stack</title>
				<link>https://infinum.com/blog/self-hosting-ai-models-a-practical-guide/</link>
				<pubDate>Thu, 16 Apr 2026 16:16:50 +0000</pubDate>
				<dc:creator>Vjekoslav Drakšić</dc:creator>
				<guid isPermaLink="false">https://infinum.com/?p=19278544</guid>
				<description>
					<![CDATA[<p>Infrastructure decisions, model selection tradeoffs, and performance optimization techniques we encountered while building a self-hosted multi-model inference stack. </p>
<p>The post <a href="https://infinum.com/blog/self-hosting-ai-models-a-practical-guide/">Self-Hosting AI Models: A Practical Guide to Building Your Own Stack</a> appeared first on <a href="https://infinum.com">Infinum</a>.</p>
]]>
				</description>
				<content:encoded>
					<![CDATA[<div
	class="wrapper"
	data-id="es-293"
	 data-animation-target='inner-items'>
		
			<div class="wrapper__inner">
			<div class="block-blog-content js-block-blog-content">
	
<div class="block-blog-content-sidebar" data-id="es-92">
	</div>

<div class="block-blog-content-main">
	
<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-95"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-93">
	<p	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-94'
	>
	Many organizations now want full ownership of their <a href="https://infinum.com/artificial-intelligence/" id="https://infinum.com/artificial-intelligence/">AI infrastructure</a>. The motivation for self-hosting ranges from data ownership requirements and contractual obligations to maintaining the highest level of system security.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-98"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-96">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-97'
	>
	This post covers the infrastructure decisions, model selection tradeoffs, and performance optimization techniques we encountered while building a self-hosted multi-model inference stack. <a href="https://infinum.com/blog/ai-generated-code-security-risks/" id="https://infinum.com/blog/ai-generated-code-security-risks/">Security architecture</a> and model licensing are out of scope here as both deserve their own deep dives. Still, everything about building the infrastructure and making it perform is fair game.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-101"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-99">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-100'
	>
	Here&#8217;s what our stack looked like:</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-104"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="lists" data-id="es-102">
	<ul	class='typography typography--size-16-text-roman js-typography lists__typography'
	data-id='es-103'
	>
	<li>An open-source inference engine (<a href="https://github.com/vllm-project/production-stack">vLLM production stack</a>)</li><li>A multi open-weight model setup</li><li>Accelerated computing instances on AWS</li><li>A scalable, highly available EKS cluster</li></ul></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-107"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-105">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-106'
	>
	vLLM as an inference engine</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-110"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-108">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-109'
	>
	There are several open-source inference engines to choose from, including <a href="https://github.com/InternLM/lmdeploy">LMDeploy</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, and <a href="https://github.com/NVIDIA/TensorRT-LLM">TensorRT-LLM</a>.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-113"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-111">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-112'
	>
	We chose vLLM for its performance, <a href="https://docs.vllm.ai/en/latest/models/supported_models/">broad model support</a>, <a href="https://docs.vllm.ai/en/latest/">extensive documentation</a>, and built-in multi-model type routing. </p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-116"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-114">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-115'
	>
	Their production stack ships with an infrastructure diagram you can extend for your own setup, but the core components are:</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-120"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-117">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<p	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-118'
	>
	<strong>Request router</strong></p><p	class='typography typography--size-18-text-roman js-typography bullet__paragraph'
	data-id='es-119'
	>
	An OpenAI-compatible API layer. It uses prefix-aware routing to direct repeat context to the same worker, reducing time to first token. In a multi-model setup, the router handles requests by endpoint, model name, and worker assignment.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-124"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-121">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<p	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-122'
	>
	<strong>Workers</strong></p><p	class='typography typography--size-18-text-roman js-typography bullet__paragraph'
	data-id='es-123'
	>
	vLLM instances running on GPU nodes. The stack handles tensor parallelism across multiple GPUs for large models out of the box.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-128"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-125">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<p	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-126'
	>
	<strong>KV cache storage</strong></p><p	class='typography typography--size-18-text-roman js-typography bullet__paragraph'
	data-id='es-127'
	>
	In a multi-worker setup, previously computed state is retrieved from <a href="https://docs.lmcache.ai/">LMCache</a>, which delivers significant performance gains, especially for models like <a href="https://blog.lmcache.ai/en/2025/08/05/lmcache-supports-gpt-oss-20b-120b-on-day-1/">GPT OSS</a>.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-132"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-129">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<p	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-130'
	>
	<strong>Observability stack</strong></p><p	class='typography typography--size-18-text-roman js-typography bullet__paragraph'
	data-id='es-131'
	>
	Prometheus and Grafana for monitoring.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-135"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-133">
	<h3	class='typography typography--size-30-text js-typography block-typography__typography'
	data-id='es-134'
	>
	Simplified end-to-end flow</h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-138"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-136">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-137'
	>
	The request router analyzes the incoming prompt&#8217;s prefix and directs it to a worker that already holds that context in memory. The worker processes the request with optimized block-based memory management, pulling previously computed states from a per-node or cluster-wide cache, and generates the response.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-141"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-139">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-140'
	>
	Choosing the right hosting environment</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-144"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-142">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-143'
	>
	The AI hosting landscape is competitive. The vLLM production stack has <a href="https://github.com/vllm-project/production-stack/tree/main/deployment_on_cloud">cloud deployment support</a> for AWS, Azure, and GCP, and project velocity matters a lot at this stage.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-147"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-145">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-146'
	>
	This is why we chose AWS EKS. The cost savings from alternative providers did not justify the increased setup complexity.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-150"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-148">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-149'
	>
	Specialty cloud hosting providers are cheaper, but they often offer unmanaged environments. That means you handle all the heavy lifting yourself, like networking, orchestration, GPU scheduling, the lot.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-153"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-151">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-152'
	>
	On-premise considerations</h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-156"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-154">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-155'
	>
	Buying hardware immediately is an operational risk. This is true even if you have predictable workloads.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-159"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-157">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-158'
	>
	<strong>We recommend a phased approach:</strong></p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-163"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-160">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<h4	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-161'
	>
	<strong>Phase 1: Model PoC</strong></h4><p	class='typography typography--size-20-text-roman js-typography bullet__paragraph'
	data-id='es-162'
	>
	Optional if you already know the model you want. Use managed services like AWS Bedrock to find the sweet spot between model size and reasoning capability. The open-weight model catalogue is expanding fast and the setup is minimal.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-165"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<hr
	class="block-divider"
	data-id="es-164"
	aria-hidden="true" />	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-169"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-166">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<h4	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-167'
	>
	<strong>Phase 2: Cloud PoC</strong></h4><p	class='typography typography--size-20-text-roman js-typography bullet__paragraph'
	data-id='es-168'
	>
	Use cloud-managed Kubernetes to prototype your multi-model infrastructure. Test different GPU offerings, benchmark your setup, and figure out your TPM and RPM requirements. Test your open-source model choices without locking into expensive hardware early.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-171"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<hr
	class="block-divider"
	data-id="es-170"
	aria-hidden="true" />	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-175"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="bullet bullet--left bullet__type--dot bullet__color--black block-bullet__bullet" data-id="es-172">
			<div class="bullet__dot"></div>
		<div class="bullet__content">
		<h4	class='typography typography--size-24-text js-typography bullet__heading'
	data-id='es-173'
	>
	<strong>Phase 3: On-premise refinement</strong></h4><p	class='typography typography--size-20-text-roman js-typography bullet__paragraph'
	data-id='es-174'
	>
	Once you understand your patterns and limits, modify your existing Kubernetes cluster for an on-premise deployment. This is significantly easier than starting here from scratch.</p>	</div>
</div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-178"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-176">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-177'
	>
	Choosing the right model</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-181"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-179">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-180'
	>
	To simplify the equation: the two factors that drive infrastructure cost are <strong>model size</strong> (parameter count) and <strong>model context</strong> (the active memory window containing your conversation and retrieved data).</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-184"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-182">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-183'
	>
	The LLM is your main challenge. Embedding and reranking models require comparatively little GPU power.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-187"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-185">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-186'
	>
	Here are three scenarios to illustrate the range. Note that these are rough on-demand estimates, and be sure to check current pricing and consider reserved or spot instances where applicable.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-190"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-188">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-189'
	>
	Small: chatbot with basic interactions</h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-193"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-191">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-192'
	>
	Customer support, simple Q&amp;A. No complex reasoning or large context required.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-196"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="lists" data-id="es-194">
	<ul	class='typography typography--size-16-text-roman js-typography lists__typography'
	data-id='es-195'
	>
	<li><strong>Size:</strong> 7B or 8B parameters</li><li><strong>Context:</strong> 2k–20k tokens</li><li><strong>OSS models:</strong> Llama 3 (8B), Mistral (7B), Qwen (7B)</li><li><strong>Proprietary use case equivalents:</strong> GPT-4o-mini, Claude Haiku, Gemini Flash-Lite</li><li><strong>Infrastructure:</strong> A single G6e family instance</li><li><strong>Monthly cost:</strong> ~$400–$600</li></ul></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-198"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<hr
	class="block-divider"
	data-id="es-197"
	aria-hidden="true" />	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-201"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-199">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-200'
	>
	Medium: reasoning over a knowledge base</h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-204"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-202">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-203'
	>
	Internal knowledge bases where the model reads retrieved company documents, follows strict instructions, and needs to minimize hallucinations.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-207"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="lists" data-id="es-205">
	<ul	class='typography typography--size-16-text-roman js-typography lists__typography'
	data-id='es-206'
	>
	<li><strong>Size:</strong> 70B parameters</li><li><strong>Context:</strong> 20k–50k tokens</li><li><strong>OSS models:</strong> Llama 3 (70B), Mixtral (8x7B), Qwen (72B), GPT-OSS-20B</li><li><strong>Proprietary use case equivalents:</strong> Claude Sonnet, Gemini Flash</li><li><strong>Infrastructure:</strong> Multi-GPU setup</li><li><strong>Monthly cost:</strong> ~$3k–$8k</li></ul></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-209"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<hr
	class="block-divider"
	data-id="es-208"
	aria-hidden="true" />	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-212"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-210">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-211'
	>
	Large: high accuracy, high reasoning, high context</h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-215"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-213">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-214'
	>
	Complex code refactoring, massive document analysis, predictions, and advanced agents. Maximum accuracy and minimal hallucinations are non-negotiable.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-218"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="lists" data-id="es-216">
	<ul	class='typography typography--size-16-text-roman js-typography lists__typography'
	data-id='es-217'
	>
	<li><strong>Size:</strong> 100B+ parameters</li><li><strong>Context:</strong> 50k+ tokens</li><li><strong>OSS models:</strong> GPT-OSS-120B, DeepSeek-R1, Mistral Large 3</li><li><strong>Proprietary use case equivalents:</strong> GPT-5, Claude Opus, Gemini Pro</li><li><strong>Infrastructure:</strong> p5e.48xlarge instances (8×H200)</li><li><strong>Monthly cost:</strong> ~$30k+</li></ul></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-221"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-219">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-220'
	>
	<strong>These are rough single-environment estimates.</strong> Multi-environment, highly available enterprise setups multiply these figures quickly.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-224"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-222">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-223'
	>
	Benchmarking your setup</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-227"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-225">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-226'
	>
	Although there are fast general benchmarking tools available, like <a href="https://github.com/AlexsJones/llmfit">LLMfit</a>, you should measure model performance in your own environment. This also reveals hardware traps that generic benchmarks won&#8217;t surface.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-230"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-228">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-229'
	>
	For example, adding more L40S GPUs may not increase performance. These GPUs communicate over the PCIe bus instead of NVLink, and the communication overhead can cancel out the compute gains.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-233"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-231">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-232'
	>
	vLLM has a native benchmarking option via the <a href="https://docs.vllm.ai/en/latest/cli/bench/serve/">bench serve</a> command. The key metrics to watch:</p></div>	</div>

<div
	class="wrapper"
	data-id="es-234"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="wrapper__inner">
			
<figure class="wp-block-table"><table><thead><tr><th>Metric</th><th>Meaning</th></tr></thead><tbody><tr><td><strong>Median TTFT</strong> (Time to First Token)</td><td>How long from prompt submission to the first generated token. The user&#8217;s perceived responsiveness.</td></tr><tr><td><strong>Median TPOT</strong> (Time Per Output Token)</td><td>How long each subsequent token takes to generate.</td></tr><tr><td><strong>Median ITL</strong> (Inter-Token Latency)</td><td>The gap between consecutive tokens. Smoothness of streaming output.</td></tr><tr><td><strong>Output token throughput</strong></td><td>Tokens generated per second across all concurrent users.</td></tr><tr><td><strong>Total token throughput</strong></td><td>Combined rate for both prompt processing and generation.</td></tr><tr><td><strong>Request throughput</strong></td><td>Complete requests resolved per second.</td></tr><tr><td><strong>Max request concurrency</strong></td><td>Peak number of simultaneous requests handled during the test.</td></tr></tbody></table></figure>
		</div>
	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-237"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-235">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-236'
	>
	Optimization techniques</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-240"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-238">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-239'
	>
	There is extensive documentation on optimization techniques. Here&#8217;s a summary of those that made the biggest difference for us.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-243"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-241">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-242'
	>
	<a href="https://docs.vllm.ai/en/latest/features/quantization/">Quantization</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-246"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-244">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-245'
	>
	Reduces weight precision (e.g., from 16-bit to 8-bit or 4-bit) to shrink the model&#8217;s memory footprint. This has a direct impact on what model you can fit on your available hardware.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-249"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-247">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-248'
	>
	<a href="https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/#introduction">Automatic prefix caching</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-252"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-250">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-251'
	>
	Worker/Node level memory management. Caches the KV state of existing queries. If you&#8217;re querying the same long document multiple times, the document is processed once and subsequent queries pull from cache. The result is higher throughput and lower latency.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-255"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-253">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-254'
	>
	Distributed caching via <a href="https://lmcache.ai/">LMCache</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-258"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-256">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-257'
	>
	Automatic prefix caching is limited to a single worker&#8217;s GPU VRAM — extremely fast, but expensive. LMCache enables cluster-wide offloading to cheaper storage (CPU memory, disk, or Redis) at the cost of some latency. Use both in a tiered memory hierarchy for the best balance.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-261"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-259">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-260'
	>
	<a href="https://docs.vllm.ai/en/stable/serving/parallelism_scaling/">Tensor parallelism</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-264"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-262">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-263'
	>
	Workload distribution. Splits tensors across multiple GPUs. Effectively a requirement for larger models. Performance depends heavily on fast interconnects like NVLink.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-267"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-265">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-266'
	>
	<a href="https://docs.vllm.ai/en/latest/features/speculative_decoding/">Speculative decoding</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-270"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-268">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-269'
	>
	There are multiple methods. One approach pairs a large model with a tiny, fast model. The fast model guesses the next tokens, and the large model verifies them in a single pass. This multiplies token generation speed.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-273"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-271">
	<h3	class='typography typography--size-36-text js-typography block-typography__typography'
	data-id='es-272'
	>
	<a href="https://docs.vllm.ai/en/latest/features/disagg_prefill/">Disaggregated prefilling</a></h3></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-276"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-274">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-275'
	>
	Separates the prefill and decode stages onto different GPUs or nodes. Since the two tasks have different computational profiles (compute-bound vs. memory-bandwidth-bound), you can scale each independently — either to improve responsiveness or to prevent long prompts from stalling active generation.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-279"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-277">
	<h2	class='typography typography--size-52-default js-typography block-typography__typography'
	data-id='es-278'
	>
	What&#8217;s ahead for self-hosting</h2></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-282"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-280">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-281'
	>
	There will always be demand for on-premise self-hosted AI in systems that require maximum control over their data.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-285"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-283">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-284'
	>
	The barrier to entry is dropping. Inference engines are maturing, optimization techniques are compounding, and models are getting better with fewer parameters and lower VRAM requirements. The recent Gemma 4 release is a good example: judging by the benchmarks, it delivers strong performance for a modest hardware investment. Stay tuned for a deep dive on that one.</p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-288"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-286">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-287'
	>
	In conclusion, enterprise-grade self-hosting remains<strong> expensive, but the trajectory is clear: organizations will be able to do significantly more with <a href="https://ai.google.dev/gemma/docs/core">significantly less hardware</a>.</strong> </p></div>	</div>

<div
	class="wrapper wrapper__use-simple--true"
	data-id="es-291"
	 data-animation='slideFade' data-animation-target='inner-items'>
		
			<div class="block-typography" data-id="es-289">
	<p	class='typography typography--size-16-text-roman js-typography block-typography__typography'
	data-id='es-290'
	>
	The phased approach we outlined here is designed to let you start proving value now, without committing to infrastructure you don&#8217;t yet understand.</p></div>	</div>
</div>
</div>		</div>
	</div><p>The post <a href="https://infinum.com/blog/self-hosting-ai-models-a-practical-guide/">Self-Hosting AI Models: A Practical Guide to Building Your Own Stack</a> appeared first on <a href="https://infinum.com">Infinum</a>.</p>
]]>
				</content:encoded>
			</item>
		
	</channel>
</rss>