Engineering notes · Benchmarks · Agentic Workflows

Gemma4 as a Builder: The "Think/Build" Pattern & Hardware Bottlenecks

Tyler Merritt · May 17, 2026

The dream was simple: a local Sonnet.

Not a demo. Something that could follow a multi-step plan, catch its own mistakes, and execute at the level where you weren't constantly looking over its shoulder. Claude handles the architecture, generates the dense IMPLEMENTATION.md, hands it off. Gemma builds. You watch things get done.

The scaffolding almost worked. I ported the agentic skills from ~/.claude/skills into OpenCode, resolved a parser mismatch where Gemma4's raw thinking output wasn't triggering tool calls — the model needed to see its own internal monologue as a precursor to action before it could reliably chain steps. That part held. The harness held.

Then we actually tried to build something.

The aroha_gold project was not a demo — Prisma-backed database, Next.js Server Actions, Stripe integration, full Vitest and Playwright coverage. The kind of codebase where you can't fake it. Eight hours to complete the core logic. That sounds reasonable until you account for the error tax.

The failures weren't usually wrong logic. They were structural clobbering. The model would decide the safest way to ensure a function existed was to re-paste the entire block, leading to duplication that cascaded. Or it would get lost mid-file and drop a closing brace, then spend several subsequent turns slowly working backward through the TypeScript errors it had just created. It would lint the file. It would look at the output. And then it would miss the double curly braces, the stray >> in a template literal, the line break that broke the formatter.

Not because it wasn't checking. It was. It just couldn't see what it was looking at.

Running the 31b model made this worse by adding time to every iteration. About 10 tok/s in practice — watching paint dry. I didn't want to open VS Code and hand-edit the files, because that would have felt like going back in time a decade. But I also needed things to move faster than they did. In a coding workflow, the model has to feel like it's typing — present, live, keeping pace. The 26b at 75+ tok/s with MTP felt that way. The 31b felt like watching someone spell-check a document out loud, one word at a time.

The Think/Build pattern is sound in theory. Claude reasons, Gemma executes. The problem is the gap between execution and verification. The move that looks thorough and the move that's actually right turn out to be different moves.

The dream was a local Sonnet. What I got was a local builder that looks like it's checking its work. Those aren't the same thing.

Related: GB10/DGX Spark reality check: Gemma4 MTP, NVFP4 caps, and a silent vLLM failover trap — the hardware benchmarks that established the baseline for this workflow.