Engineering notes · benchmarks

GB10/DGX Spark reality check: Gemma4 MTP gets 75-80 tok/s, NVFP4 caps at 50, and a silent vLLM failover trap that cost me an afternoon

You're goddamn right I had Claude generate all this - but I did go through it all this afternoon -------- (death to the emdash)

TL;DR — Spent today benchmarking local inference on a DGX Spark (GB10 Superchip, SM121, 128GB unified). Three findings worth sharing:

  1. SM121 has NO native FP4 tensor cores. NVFP4 quants on this hardware run via Marlin software decompression to BF16, capping at ~50–52 tok/s regardless of model size. Native FP4 compute is GB200/GB300 (SM90a+) only. If you bought a Spark thinking "Blackwell = FP4 acceleration," you got a half-truth — FP8 is the right native format here.
  2. Gemma4 MTP needs vLLM PR #41745 (merged May 6). The vllm/vllm-openai:gemma4-0505-cu130 image ships with two bugs in gemma4_mtp.py: intermediate_size was being read from the top-level config (4096) instead of text_config.intermediate_size (8192), so the drafter MLP was half-sized. Plus quant_config got propagated from FP8 target to BF16 drafter Linear layers, causing shape mismatch. Without the fix, MTP makes things slower (~20 tok/s vs 35 baseline).
  3. vLLM tool calling silently fails over. If you serve Gemma4 without --enable-auto-tool-choice --tool-call-parser gemma4, any client sending tool_choice: "auto" gets HTTP 400. If you have a router with fallback (OpenClaw, LiteLLM, etc), requests silently land on a different model. I shipped my "Gemma4 daily driver" for an hour before realizing every request was hitting Qwen.

Real numbers on GB10

Single-stream, /v1/chat/completions, 512-token coding prompt:

Model + Engine Quant tok/s
Gemma4 26B A4B + vLLM + gemma4_mtp (N=4) FP8-Dynamic 75.7
Qwen3.6-35B-A3B + llama.cpp MXFP4 63.7
Gemma4 26B A4B + vLLM (no MTP) NVFP4 50.0
Gemma4 26B A4B + vLLM (no MTP) FP8-Dynamic ~35

MTP acceptance rate is content-dependent: ~76% on code (clean structure), ~50% on prose (entropy). Per-position acceptance for code at N=4: 91% / 90% / 89% / 85% conditional. The drafter is genuinely good.


num_speculative_tokens sweep on Gemma4 MTP

Same prompt, same model, varying spec budget:

N tok/s Avg acceptance
2 67.5 87%
3 71.2 84%
4 80.0 76%
5 76.9 66%
6 72.2 56%

N=4 is the throughput optimum here. Below N=4: not enough accepted tokens per draft step. Above N=4: drafter forward-pass overhead exceeds the gain from extra positions.


Workaround for the PR #41745 image

Until a nightly with the fix lands on Docker Hub, build a custom image that overlays vLLM main Python source on top of the existing inference container. The compiled .so kernels stay (they were built for SM121/CUDA 13.0); only the .py files get replaced.

FROM vllm/vllm-openai:gemma4-0505-cu130

RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

RUN SITE_PKG=$(python3 -c "import site; print(site.getsitepackages()[0])") \
 && git clone --depth=1 https://github.com/vllm-project/vllm.git /tmp/vllm-src \
 && cp -r /tmp/vllm-src/vllm/* "${SITE_PKG}/vllm/" \
 && rm -rf /tmp/vllm-src

Builds in ~5 seconds. No nvcc, no cmake, no 90-min compile.


Working serve command on SM121

vllm serve RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
 --served-model-name gemma4-fp8-mtp \
 --max-model-len 65536 \
 --gpu-memory-utilization 0.45 \
 --max-num-seqs 4 \
 --max-num-batched-tokens 8192 \
 --enable-auto-tool-choice \
 --tool-call-parser gemma4 \
 --reasoning-parser gemma4 \
 --port 11437 \
 --speculative-config '{"method":"gemma4_mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":4}'

Honest limitation

I couldn't reproduce the community's reported 108 tok/s. Best I could pull was 80 tok/s with the bare-minimum config (no tool/reasoning parsers, 32k context, FP16 KV). With the production feature set, ~75 tok/s.