Gemma 4: Half the GPUs, Most of the Quality

Self-hosting competitive AI just got significantly cheaper. A new open-weight model from Google delivers near-identical chat quality to models four times its size - at roughly $14/hr instead of $28/hr. We ran the benchmarks to show where it wins, where it doesn't, and which variant fits which workload.

Until recently, self-hosting a competitive open-weight model meant committing to significant GPU investments.

Running GPT-OSS 120B in production – at realistic context lengths and with headroom for concurrent users – means at least four H100s. The frontier models like DeepSeek V4-Pro and Qwen 3.5-397B need even more.

Gemma 4 shifts the math. 

We benchmarked the dense 31B against GPT-OSS 120B on identical hardware. At realistic workloads, Gemma 4 holds its own on chat at half the GPU count – roughly $14/hr versus $28/hr of GPU footprint.

The tradeoff is real and workload-specific, but for many production deployments, it’s a tradeoff worth making. 

What Gemma 4 actually is

Google released Gemma 4 in April 2026 under Apache 2.0. We focused on two variants: 

  • The dense Gemma 4 31B (30.7B parameters, 256k context)
  • The MoE Gemma 4 26B A4B (25.2B total, 3.8B active per token, 256k context).

How does Gemma 4 stack up against its peers on accuracy? Here’s the picture from openrouter.ai benchmarks:

TABLE 1

ModelActive Params (Total Params)Intelligence IndexCoding IndexAgentic Index
Qwen3.5-397B-A17B17B (397B)45.041.355.8
Nemotron 3 Super12B (120B)43.839.549.2
Gemma 4 31B30.7B (30.7B)39.238.740.9
GPT-OSS 120B5.1B (120B)33.328.637.9
Gemma 4 26B A4B3.8B (26B)27.129.128.9

Gemma 4 31B holds its own against open-weight models with four to ten times its total parameter count.

What we tested and why

We benchmarked Gemma 4 31B against GPT-OSS 120B on identical hardware to isolate the architectural difference. The setup:

  • Hardware – AWS p5.48xlarge (8x NVIDIA H100 80GB)
  • Runtime – vLLM. Gemma 4 31B at fp8, GPT-OSS 120B at mxfp4.
  • Tensor parallelism ( How many active GPUs ) – TP=2, TP=4, and TP=8 for Gemma 4 31B. GPT-OSS 120B OOMs at TP=2, so we ran TP=4 and TP=8.
  • Context length – 32k for the main runs, 128k for additional long-context probes.

We ran four workloads because different workloads stress different parts of the engine – and the chat vs. batch vs. RAG split is where the dense vs. MoE difference shows up clearly.

Chat – we send 2,000 real conversational prompts in unpredictable bursts, capping at 100 concurrent users. Simulates a customer-facing chatbot, and measures KV cache pressure, the dominant constraint when many users share one server.

Stress – we dump 1,000 random-length requests at the server with no rate limit. Simulates offline batch processing – summarizing thousands of articles, generating synthetic data overnight. It measures the absolute throughput ceiling.

Medium context (32k) – steady requests with 32,000-token prompts. Simulates standard RAG workflows – querying internal documentation, scanning a chunked codebase and measures prefill computation as the context window fills.

Long context (100k) – ten sequential requests with 100,000-token prompts. Simulates analyzing a large codebase and measures time-to-first-token under the heaviest prefill load.

Chat and batch don’t need the same model

Gemma 4 wins on chat and KV cache pressure. GPT-OSS 120B wins on stress throughput and RAG prefill. Architecture explains the split, not size – and that distinction is what makes Gemma 4 relevant for self-hosting decisions.

Chat: Gemma 4 keeps up

At TP=8, Gemma 4 hits 3,958 output tok/s versus GPT-OSS 120B’s 3,443. Mean Time-To-First-Token is 210 ms versus 504 ms. Gemma 4 also runs at TP=2 with no OOM and still pushes 3,113 tok/s. 

GPT-OSS 120B OOMed at TP=2 in our 32k-context, high-concurrency configuration. vLLM’s recipe supports 2x H100 at lower context, however the tests we ran to simulate a production environment tipped it over.

Stress: GPT-OSS 120B’s MoE wins 

At TP=8 on a flood of 1,000 random requests, GPT-OSS 120B pushes 12,972 output tok/s versus Gemma 4’s 5,118 – about 2.5x faster.

The MoE only activates ~5B of its 120B parameters per token, so its effective compute per token is much lower than its parameter count suggests. For batch jobs, that matters.

RAG and long context: the prefill cliff

At 32k-token prompts, GPT-OSS 120B emits the first token in 1,437 ms. Gemma 4 31B takes 16,069 ms – 11x slower. The gap widens at 100k: GPT-OSS 120B answers in 2,997 ms (under three seconds) while Gemma 4 31B sits at 36,983 ms (about 37 seconds) before producing anything.

This is the dense-vs-MoE prefill bottleneck. Dense models do full attention compute over every token of context. MoE models do not, so they sail through prefill.

Hardware cost

Gemma 4 31B serves chat well at TP=2 – two H100s. Our production-realistic GPT-OSS 120B configuration needed at least four.

On AWS p5.48xlarge on-demand list pricing ($55.04/hr, about $6.88 per H100-hour), that is roughly $13.76/hr of GPU footprint for Gemma 4 31B versus $27.52/hr for GPT-OSS 120B. Reserved, Savings Plans, and Spot all run cheaper, but the 2x ratio holds.

One other finding worth noting: doubling GPUs does not double throughput.

Gemma 4 stress throughput nearly doubles from TP=2 to TP=4 (1,721 to 3,280 tok/s), but TP=4 to TP=8 adds only 56% (3,280 to 5,118). GPT-OSS 120B is flatter still: TP=4 to TP=8 yields just 15% (11,270 to 12,972 tok/s). Past four H100s, both models hit communication overhead faster than they hit compute limits.

TABLE 2

Benchmark (TP=8)Gemma 4 31BGPT-OSS 120B
Chat – output tok/s3,9583,443
Chat – mean TTFT (ms)210504
Stress – output tok/s5,11812,972
Medium context – mean TTFT (ms)16,0691,437

The RAG answer: use the MoE variant

The 31B dense model’s prefill bottleneck makes it the wrong tool for heavy RAG, codebase or document analysis workloads. Sixteen seconds to the first token on a 32k prompt is fine for a single user, but painful at scale.

The answer is Gemma 4 26B A4B, the MoE variant. It uses the same sparse activation mechanism that makes GPT-OSS 120B fast at prefill – 3.8B active and 25.2B total. Google’s launch materials position it as 26B-class intelligence at roughly 4B-class inference cost. Published vLLM and Ollama runs on consumer Blackwell hardware confirm the throughput profile holds in practice.

For RAG-heavy self-hosting, 26B A4B is the right Gemma 4 variant.

Which variant fits which workload

Gemma 4 is a meaningful step forward for the typical self-hosting setup. The dense 31B delivers competitive accuracy at TP=2 and still serves a chatbot at over 3,000 tok/s – a fairly accurate open-weight model on roughly $14/hr of GPU footprint, versus $28+/hr for our GPT-OSS 120B configuration.

So, pick based on what you’re actually running.

  • Use the dense Gemma 4 31B for chat and general assistance. 
  • Use the MoE Gemma 4 26B A4B for RAG, document analysis, and long-context workloads. 
  • Reach for GPT-OSS 120B only if you have at least four H100s and need more concurrent throughput.

Open-weight models keep compressing the hardware vs quality frontier. Gemma 4 is the latest breakthrough on that curve – and for most chat-oriented production deployments, the case for self-hosting just got meaningfully easier to make.

Evaluating whether to self-host AI in your enterprise – or figuring out which model fits your workload and infrastructure? Infinum’s AI team helps enterprises size, deploy, and operate production AI systems. Talk to us