Until recently, self-hosting a competitive open-weight model meant committing to significant GPU investments.
Running GPT-OSS 120B in production – at realistic context lengths and with headroom for concurrent users – means at least four H100s. The frontier models like DeepSeek V4-Pro and Qwen 3.5-397B need even more.
Gemma 4 shifts the math.
We benchmarked the dense 31B against GPT-OSS 120B on identical hardware. At realistic workloads, Gemma 4 holds its own on chat at half the GPU count – roughly $14/hr versus $28/hr of GPU footprint.
The tradeoff is real and workload-specific, but for many production deployments, it’s a tradeoff worth making.
What Gemma 4 actually is
Google released Gemma 4 in April 2026 under Apache 2.0. We focused on two variants:
- The dense Gemma 4 31B (30.7B parameters, 256k context)
- The MoE Gemma 4 26B A4B (25.2B total, 3.8B active per token, 256k context).
How does Gemma 4 stack up against its peers on accuracy? Here’s the picture from openrouter.ai benchmarks:
TABLE 1
| Model | Active Params (Total Params) | Intelligence Index | Coding Index | Agentic Index |
| Qwen3.5-397B-A17B | 17B (397B) | 45.0 | 41.3 | 55.8 |
| Nemotron 3 Super | 12B (120B) | 43.8 | 39.5 | 49.2 |
| Gemma 4 31B | 30.7B (30.7B) | 39.2 | 38.7 | 40.9 |
| GPT-OSS 120B | 5.1B (120B) | 33.3 | 28.6 | 37.9 |
| Gemma 4 26B A4B | 3.8B (26B) | 27.1 | 29.1 | 28.9 |
Gemma 4 31B holds its own against open-weight models with four to ten times its total parameter count.
What we tested and why
We benchmarked Gemma 4 31B against GPT-OSS 120B on identical hardware to isolate the architectural difference. The setup:
- Hardware – AWS p5.48xlarge (8x NVIDIA H100 80GB)
- Runtime – vLLM. Gemma 4 31B at fp8, GPT-OSS 120B at mxfp4.
- Tensor parallelism ( How many active GPUs ) – TP=2, TP=4, and TP=8 for Gemma 4 31B. GPT-OSS 120B OOMs at TP=2, so we ran TP=4 and TP=8.
- Context length – 32k for the main runs, 128k for additional long-context probes.
We ran four workloads because different workloads stress different parts of the engine – and the chat vs. batch vs. RAG split is where the dense vs. MoE difference shows up clearly.
Chat – we send 2,000 real conversational prompts in unpredictable bursts, capping at 100 concurrent users. Simulates a customer-facing chatbot, and measures KV cache pressure, the dominant constraint when many users share one server.
Stress – we dump 1,000 random-length requests at the server with no rate limit. Simulates offline batch processing – summarizing thousands of articles, generating synthetic data overnight. It measures the absolute throughput ceiling.
Medium context (32k) – steady requests with 32,000-token prompts. Simulates standard RAG workflows – querying internal documentation, scanning a chunked codebase and measures prefill computation as the context window fills.
Long context (100k) – ten sequential requests with 100,000-token prompts. Simulates analyzing a large codebase and measures time-to-first-token under the heaviest prefill load.
Chat and batch don’t need the same model
Gemma 4 wins on chat and KV cache pressure. GPT-OSS 120B wins on stress throughput and RAG prefill. Architecture explains the split, not size – and that distinction is what makes Gemma 4 relevant for self-hosting decisions.
Chat: Gemma 4 keeps up
At TP=8, Gemma 4 hits 3,958 output tok/s versus GPT-OSS 120B’s 3,443. Mean Time-To-First-Token is 210 ms versus 504 ms. Gemma 4 also runs at TP=2 with no OOM and still pushes 3,113 tok/s.
GPT-OSS 120B OOMed at TP=2 in our 32k-context, high-concurrency configuration. vLLM’s recipe supports 2x H100 at lower context, however the tests we ran to simulate a production environment tipped it over.
Stress: GPT-OSS 120B’s MoE wins
At TP=8 on a flood of 1,000 random requests, GPT-OSS 120B pushes 12,972 output tok/s versus Gemma 4’s 5,118 – about 2.5x faster.
The MoE only activates ~5B of its 120B parameters per token, so its effective compute per token is much lower than its parameter count suggests. For batch jobs, that matters.
RAG and long context: the prefill cliff
At 32k-token prompts, GPT-OSS 120B emits the first token in 1,437 ms. Gemma 4 31B takes 16,069 ms – 11x slower. The gap widens at 100k: GPT-OSS 120B answers in 2,997 ms (under three seconds) while Gemma 4 31B sits at 36,983 ms (about 37 seconds) before producing anything.
This is the dense-vs-MoE prefill bottleneck. Dense models do full attention compute over every token of context. MoE models do not, so they sail through prefill.
Hardware cost
Gemma 4 31B serves chat well at TP=2 – two H100s. Our production-realistic GPT-OSS 120B configuration needed at least four.
On AWS p5.48xlarge on-demand list pricing ($55.04/hr, about $6.88 per H100-hour), that is roughly $13.76/hr of GPU footprint for Gemma 4 31B versus $27.52/hr for GPT-OSS 120B. Reserved, Savings Plans, and Spot all run cheaper, but the 2x ratio holds.
One other finding worth noting: doubling GPUs does not double throughput.
Gemma 4 stress throughput nearly doubles from TP=2 to TP=4 (1,721 to 3,280 tok/s), but TP=4 to TP=8 adds only 56% (3,280 to 5,118). GPT-OSS 120B is flatter still: TP=4 to TP=8 yields just 15% (11,270 to 12,972 tok/s). Past four H100s, both models hit communication overhead faster than they hit compute limits.
TABLE 2
| Benchmark (TP=8) | Gemma 4 31B | GPT-OSS 120B |
| Chat – output tok/s | 3,958 | 3,443 |
| Chat – mean TTFT (ms) | 210 | 504 |
| Stress – output tok/s | 5,118 | 12,972 |
| Medium context – mean TTFT (ms) | 16,069 | 1,437 |
The RAG answer: use the MoE variant
The 31B dense model’s prefill bottleneck makes it the wrong tool for heavy RAG, codebase or document analysis workloads. Sixteen seconds to the first token on a 32k prompt is fine for a single user, but painful at scale.
The answer is Gemma 4 26B A4B, the MoE variant. It uses the same sparse activation mechanism that makes GPT-OSS 120B fast at prefill – 3.8B active and 25.2B total. Google’s launch materials position it as 26B-class intelligence at roughly 4B-class inference cost. Published vLLM and Ollama runs on consumer Blackwell hardware confirm the throughput profile holds in practice.
For RAG-heavy self-hosting, 26B A4B is the right Gemma 4 variant.
Which variant fits which workload
Gemma 4 is a meaningful step forward for the typical self-hosting setup. The dense 31B delivers competitive accuracy at TP=2 and still serves a chatbot at over 3,000 tok/s – a fairly accurate open-weight model on roughly $14/hr of GPU footprint, versus $28+/hr for our GPT-OSS 120B configuration.
So, pick based on what you’re actually running.
- Use the dense Gemma 4 31B for chat and general assistance.
- Use the MoE Gemma 4 26B A4B for RAG, document analysis, and long-context workloads.
- Reach for GPT-OSS 120B only if you have at least four H100s and need more concurrent throughput.
Open-weight models keep compressing the hardware vs quality frontier. Gemma 4 is the latest breakthrough on that curve – and for most chat-oriented production deployments, the case for self-hosting just got meaningfully easier to make.
Evaluating whether to self-host AI in your enterprise – or figuring out which model fits your workload and infrastructure? Infinum’s AI team helps enterprises size, deploy, and operate production AI systems. Talk to us.