GB10/DGX Spark reality check: Gemma4 MTP gets 75-80 tok/s, NVFP4 caps at 50, and a silent vLLM failover trap that cost me an afternoon
You're goddamn right I had Claude generate all this - but I did go through it all this afternoon -------- (death to the emdash)
TL;DR — Spent today benchmarking local inference on a DGX Spark (GB10 Superchip, SM121, 128GB unified). Three findings worth sharing:
- SM121 has NO native FP4 tensor cores. NVFP4 quants on this hardware run via Marlin software decompression to BF16, capping at ~50–52 tok/s regardless of model size. Native FP4 compute is GB200/GB300 (SM90a+) only. If you bought a Spark thinking "Blackwell = FP4 acceleration," you got a half-truth — FP8 is the right native format here.
- Gemma4 MTP needs vLLM PR #41745 (merged May 6). The
vllm/vllm-openai:gemma4-0505-cu130image ships with two bugs ingemma4_mtp.py:intermediate_sizewas being read from the top-level config (4096) instead oftext_config.intermediate_size(8192), so the drafter MLP was half-sized. Plusquant_configgot propagated from FP8 target to BF16 drafter Linear layers, causing shape mismatch. Without the fix, MTP makes things slower (~20 tok/s vs 35 baseline). - vLLM tool calling silently fails over. If you serve Gemma4 without
--enable-auto-tool-choice --tool-call-parser gemma4, any client sendingtool_choice: "auto"gets HTTP 400. If you have a router with fallback (OpenClaw, LiteLLM, etc), requests silently land on a different model. I shipped my "Gemma4 daily driver" for an hour before realizing every request was hitting Qwen.
Real numbers on GB10
Single-stream, /v1/chat/completions, 512-token coding prompt:
| Model + Engine | Quant | tok/s |
|---|---|---|
| Gemma4 26B A4B + vLLM + gemma4_mtp (N=4) | FP8-Dynamic | 75.7 |
| Qwen3.6-35B-A3B + llama.cpp | MXFP4 | 63.7 |
| Gemma4 26B A4B + vLLM (no MTP) | NVFP4 | 50.0 |
| Gemma4 26B A4B + vLLM (no MTP) | FP8-Dynamic | ~35 |
MTP acceptance rate is content-dependent: ~76% on code (clean structure), ~50% on prose (entropy). Per-position acceptance for code at N=4: 91% / 90% / 89% / 85% conditional. The drafter is genuinely good.
num_speculative_tokens sweep on Gemma4 MTP
Same prompt, same model, varying spec budget:
| N | tok/s | Avg acceptance |
|---|---|---|
| 2 | 67.5 | 87% |
| 3 | 71.2 | 84% |
| 4 | 80.0 | 76% |
| 5 | 76.9 | 66% |
| 6 | 72.2 | 56% |
N=4 is the throughput optimum here. Below N=4: not enough accepted tokens per draft step. Above N=4: drafter forward-pass overhead exceeds the gain from extra positions.
Workaround for the PR #41745 image
Until a nightly with the fix lands on Docker Hub, build a custom image that overlays vLLM main Python source on top of the existing inference container. The compiled .so kernels stay (they were built for SM121/CUDA 13.0); only the .py files get replaced.
FROM vllm/vllm-openai:gemma4-0505-cu130
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
RUN SITE_PKG=$(python3 -c "import site; print(site.getsitepackages()[0])") \
&& git clone --depth=1 https://github.com/vllm-project/vllm.git /tmp/vllm-src \
&& cp -r /tmp/vllm-src/vllm/* "${SITE_PKG}/vllm/" \
&& rm -rf /tmp/vllm-src
Builds in ~5 seconds. No nvcc, no cmake, no 90-min compile.
Working serve command on SM121
vllm serve RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
--served-model-name gemma4-fp8-mtp \
--max-model-len 65536 \
--gpu-memory-utilization 0.45 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--port 11437 \
--speculative-config '{"method":"gemma4_mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'
Honest limitation
I couldn't reproduce the community's reported 108 tok/s. Best I could pull was 80 tok/s with the bare-minimum config (no tool/reasoning parsers, 32k context, FP16 KV). With the production feature set, ~75 tok/s.