Essay · 13 min read

The Kernel I Wrote Without Knowing CUDA

Tyler Merritt · May 29, 2026

I don't know CUDA. I want to say that plainly, without apology or preamble. I have never written a GPU kernel in my life. I cannot explain from first principles what a UMMA instruction does inside a Blackwell chip, or describe the register layout of a streaming multiprocessor, or tell you how a collective operation pipelines tiles between memory tiers. I am 25 years into a technology career and I am, on this particular subject, a complete beginner.

Last week I filed a pull request to the FlashInfer library — an upstream open-source project used by vLLM and other major inference frameworks — that adds a native OCP-MX-FP8 grouped MoE kernel for SM120/121 GPUs. The kernel compiles. It loads. It passes 6/6 correctness tests against a BF16 reference. It makes Gemma4 answer "capital of France?" with "Paris." It runs 1.28 to 1.34 times faster than the incumbent at high concurrency. The pull request is open at flashinfer-ai/flashinfer#3463.

I still cannot fully explain how the kernel works at the instruction level. That was the point of the experiment.

I own a DGX Spark. That's NVIDIA's GB10 system: 128 gigabytes of unified LPDDR5X memory, a Blackwell GPU running at compute capability SM_121, 294 teraflops of FP8 tensor core throughput on paper. It is a workstation built specifically for the kind of local inference work I run — large language models, on real hardware, fast.

The model I care about right now is Gemma4-26B-A4B, running in MXFP8W8A8 quantization. The short version: 26 billion parameters stored as 8-bit floating point values in the OCP-MX format, with a 32-element block scale that preserves accuracy. Only about 4 billion of those parameters activate per forward pass because of the mixture-of-experts architecture — the model routes each token through a small subset of "expert" layers. It is efficient by design. On hardware that has dedicated tensor cores for exactly this 8-bit format, it should be very fast.

It wasn't. Not as fast as it should have been.

The MoE layers — the whole point of the architecture, the reason the model fits on consumer hardware — were falling back to MARLIN. That's a dequantization path: it takes the 8-bit weights and expands them back to 16-bit before doing the math. You still load compact 8-bit values from memory, so you get the bandwidth benefit of quantization. But the actual arithmetic runs in BF16, and the FP8 tensor cores sit idle. It is like owning a car with a high-performance engine and having the transmission permanently locked in second gear because the software never told it the other gears exist.

The reason was technically simple and practically significant: there was no native OCP-MX-FP8 grouped MoE kernel for SM120/121 anywhere in the ecosystem. Not in FlashInfer. Not in vLLM. Not anywhere to compile. NVIDIA ships pre-built binaries for their datacenter Blackwell chips — the GB200, SM100 — but those are compiled artifacts. You cannot add an architecture target to a compiled binary. SM120/121 owners were simply not served. The hardware existed. The path didn't.

I had a choice to make. I could file a GitHub issue, add my voice to the long thread of SM12x owners noting the gap, and wait. That was the sensible option. I could see there were already thumbs-up reactions on the relevant issue. Another one wasn't going to move anything.

Or I could try to write the kernel.

I want to be honest about why most people in my position stop at option one. It's not laziness. It's accurate pattern recognition. You look at a CUDA kernel codebase — the C++ templates four abstraction layers deep, the architecture-specific dispatch tables, the collective operation specializations, the memory alignment constraints baked into fixed-point math — and you recognize that you are not equipped to do this work alone. That recognition is correct. I was not equipped to do this work alone.

But "if not me, who? If not now, when?" is a principle I actually live by. I had the hardware. The gap was real and documented. The community of SM120/121 owners — DGX Spark, RTX Pro 6000, RTX 5090 — who want near-lossless MXFP8 quality and native tensor core performance is growing. And I had something I hadn't had before when facing technical territory I didn't know: a senior engineer in the room.

The experiment I designed was this: I would role-play as a junior developer in a codebase entirely outside my expertise, using AI as my primary technical partner. Not as a rubber duck. Not as a code formatter. As the engineer who knows the domain and runs the investigation while I navigate the process.

What I brought to that partnership was not CUDA knowledge. It was 25 years of foundational understanding about how software works. At the most fundamental level, all software is data and the instructions that act on data. That is true whether you are writing a Ruby on Rails controller or a CUDA collective operation for a tensor core. If you understand that, you can follow an investigation even when you couldn't have initiated it. You can recognize when the approach is correct. You can ask the right questions and know when an answer doesn't make sense. That judgment is not domain-specific. It is more portable than most people think.

I also brought a discipline we agreed on from the start: no guessing. Every fix would be grounded in the exact error message, the working reference implementation, or the actual source code. If we didn't know why a change would work, we wouldn't make it. We would read until we knew.

Six builds. Six errors. Six fixes. That is the complete story of getting the kernel to compile and load.

Step	What we changed	What it revealed
Baseline	Nothing — unmodified build	First failure: missing submodule headers. After init: build succeeded in 410s, 90 objects. The CUTLASS stack compiled under CUDA 13.0 on SM121. Foundational risk retired.
1	Admitted `(e4m3, e4m3)` to the SM120 MoE specialization gate	Compiled and linked — then failed at load. The dispatcher referenced a launcher with no instantiation. Compiles ≠ loads.
2	Generated a plain (non-block-scaled) fp8×fp8 variant	Wrong collective: fell through to a Hopper code path, not the SM120 block-scaled one. Not an ISA wall — a schedule/collective mismatch.
3	Generated as block-scaled (`is_mx_fpx=True`)	Reached the correct SM120 collective. Two asserts: (a) weight type hardcoded to MXFP4; (b) stages < 2. Good errors — both named exactly what needed to change.
4	Changed weight type from `mx_float4_t` to `mx_float8_t`	The MXFP4 assertion disappeared. Only the stage count assertion remained.
5	Forced `StageCount<2>` for the SM120 W-MXFP8 collective	All asserts gone. Compiled — but undefined symbol at load: the plain fp8 dispatcher path compiled for SM120 but had no valid instantiation.
6	Guarded out plain fp8×fp8 on SM120 in the dispatcher	BUILD OK — 474 seconds, 90 objects, module loads cleanly. A native W-MXFP8 grouped MoE collective compiled and loaded on SM120/121.

Read that table carefully and you'll see the discipline. Nothing speculative. No "let me try a few things." The working dense SM120 kernel and the SM100 reference told us everything. The compiler told us what was wrong with each attempt. We read. We fixed exactly that. We built again.

I want to tell you about the incident, because it is part of the story.

Midway through this project, I launched a CUTLASS build on the DGX Spark without containing it in a memory-limited scope. The default compiler parallelism is around 20 simultaneous jobs. Each nvcc process consumes multiple gigabytes of memory. The DGX Spark uses unified memory — CPU and GPU share the same pool. Two production vLLM services were already running. The math resolved itself over about four hours of progressive memory exhaustion, a locked machine, and a physical power-cycle to recover.

After that: every compile-triggering command — including pytest, which JIT-compiles on first call — runs inside a systemd-run scope with a hard 90 GB ceiling, 8 jobs maximum, CPU throttled. The cgroup kills the build before it kills the box. The lesson I kept having to re-learn: guardrails exist precisely because attention lapses under the momentum of a long chase. The cgroup, not my discipline, is what makes a forgotten cap a contained annoyance instead of a four-hour outage.

Compiling the kernel was one gate. Making Gemma4 serve correct output through it was several more — and this is where "it works" became a phrase I trusted less and less with each false summit.

After the kernel compiled, loaded, and passed correctness tests, we wired it into vLLM's routing layer — the Python code that selects which backend handles each MoE forward pass. Then came the Docker image: eight sequential build failures, each invisible until the previous one fell. The host's glibc was newer than the container's — had to compile the kernel inside the image. No GPU available during docker build — had to compile via docker run --gpus all. Runtime image had no nvrtc.h header — mounted the host CUDA toolkit. Missing libnvrtc.so symlink — created it, relinked incrementally. None of these were foreseeable from the plan. Each was a precise, local fix once seen.

The model loaded. Health check returned 200. The logs said FLASHINFER_CUTLASS. The server was serving.

The logprobs came back as NaN.

Finite responses. Valid HTTP 200. Numerically meaningless output. This is the most convincing false summit in all of engineering: a system that is running, correctly shaped, correctly routed — and computing nonsense. The health check is not the test. A server that returns something tells you almost nothing.

The diagnostic move that mattered was changing the iteration loop. Debugging inside a Docker rebuild cycle — five minutes to rebuild the image, five minutes to reload a 26-billion-parameter model — means each hypothesis costs ten minutes minimum. We moved the test into a development environment running the kernel directly at the real expert shapes, comparing to a BF16 reference. Seconds per hypothesis instead of minutes.

Two runs settled three competing explanations. At an intermediate dimension of 512 — a clean multiple of 128 — the kernel gave a relative error of 0.067. Correct. At dimension 704 — Gemma4's actual intermediate size, which is 5.5 times 128, a partial tile — the error was 0.74. Wrong, but finite. That's a very specific kind of failure: the partial-tile case where the kernel's scale factor indexing doesn't land on a clean boundary.

Fix: pad the intermediate dimension from 704 to 768 before calling the kernel. Zero channels contribute nothing to the computation. The kernel sees a clean tile. The padding is transparent to the output.

The second fix was the activation scale factors, which needed to be in a specific swizzled layout. The third and final fix was the gate-and-up weight ordering. The FlashInfer kernel applies the GeGLU activation to the second half of the first linear projection's output. The Gemma4 checkpoint stores those two halves in the opposite order — gate first, up second. The TRTLLM code path had a swap operation to handle this. My wiring didn't. Every single forward pass was applying the gated activation to the wrong half of the computation.

I added the swap. Rebuilt the overlay. Started the model.

Capital of France?

Paris.

17 plus 25?

42.

MARLIN-quality output. On the native FP8 tensor core path.

Then we measured it. This is where the analytics become interesting — and where the tempting shortcut is tempting precisely because it would produce a flattering number.

The shortcut: benchmark W-MXFP8 against a BF16 cutlass kernel. If it wins, it wins, right? The problem is that BF16 loads 16-bit weights from memory, while both MARLIN and W-MXFP8 load 8-bit weights. At low batch sizes — the actual interactive serving regime — memory bandwidth dominates and the BF16 path is the slow one regardless of tensor cores. Beating BF16 at decode tells you nothing about whether you beat MARLIN at decode. You can manufacture a number that looks like a win and evaporates at the exact batch size you run in production. Recognizing this before running the benchmark was the single most valuable analytical step in the entire project.

The only honest comparison: same image, same model, same flags. Only the --moe-backend flag different.

Concurrency	flashinfer_cutlass (tok/s)	marlin (tok/s)	Speedup
1	30.1	31.6	0.95×
2	77.5	73.6	1.05×
4	131.0	119.0	1.10×
8	254.8	255.9	1.00×
16	493.3	367.5	1.34×
32	878.3	688.4	1.28×

Exactly what the analysis predicted. At low concurrency — memory-bandwidth-dominated, the live serving regime — the two paths are within noise. Both load 8-bit weights; neither can outrun the other on bandwidth. At high concurrency — compute-dominated — the native kernel runs on tcgen05 FP8 tensor cores while MARLIN dequants to 16-bit and runs BF16 math. 1.28 to 1.34 times faster where compute is the bottleneck.

Same near-lossless MXFP8 quality. Real throughput gains for the workloads where throughput matters most: prefill-heavy inference, batch processing, high-concurrency serving.

The pull request is at flashinfer-ai/flashinfer#3463. Eight files changed. Fifty-six lines added, seventeen removed. The commit message says "AI-assisted implementation" and lists Claude as co-author. I didn't hide that. It would be dishonest to hide it — and the dishonesty would be beside the point anyway.

The maintainers will have opinions. This is a codebase maintained by people who have thought about CUDA kernel architecture at a depth I will never reach in two days of work. They may want the partial-tile padding handled differently. They may have cleaner approaches to the scale factor layout. They may reject it entirely and write their own version. That is how open source is supposed to work. A pull request is a proposal, not a demand.

What I can say is that the proposal is technically grounded. It didn't emerge from speculation — it was built compiler-error by compiler-error, each change anchored in an existing working reference. The correctness tests pass on the actual hardware. The model produces correct output. The performance profile matches what the physics of the hardware predicts. If a more experienced engineer looks at this and says "I would have done it differently" — they are almost certainly right. "Differently" is not the same as "wrong."

Here is what I was actually testing.

The engineering was the method. The question underneath it was: what can someone genuinely out of their depth in a domain accomplish when AI is the senior in the room? Not "what can AI do" — I find that framing uninteresting. The more important question is what you can do when the expert is the partner, not the decision-maker.

I have had a front-row seat to the "AI slop" argument for two years. The argument is usually framed as a quality concern but it is, underneath, a gatekeeping concern: the idea that output produced with AI assistance doesn't count, that you have to earn the right to ship by having suffered through the credential path first. I understand the instinct. I felt it myself. I spent a long time learning things the hard way, and there is a part of the human brain that resents watching someone skip straight to the result.

But what bothers me — really bothers me — is what that argument says to the people it's aimed at. Most of the slop critique lands on junior engineers. People who entered the workforce at the exact moment these tools became available. People who did not have the option of spending a decade building the traditional calluses because by the time they showed up, the tools existed and there was no rational reason not to use them.

Think about it this way. A blacksmith develops calluses from swinging a hammer for years. Those calluses carry real knowledge — not just toughened skin but muscle memory, judgment about when the iron is ready, understanding of what the hammer's angle does to the shape. The calluses are not incidental. They contain information.

But the point was never the calluses. The point was the iron.

If a machine appeared that shaped iron just as well — or better — without requiring calluses, the blacksmith who refused to use it because calluses are how it's supposed to feel is not protecting craft. He is protecting his own biography. Those are different things, and pretending otherwise is self-serving.

What I learned from two days in a CUDA codebase I'd never seen before is that foundational knowledge matters enormously and specific domain knowledge can be borrowed. What cannot be borrowed is the judgment to know when an investigation is proceeding correctly — the pattern recognition built from having debugged a hundred failures in a hundred different ways, the sense of when a result that looks right is actually right, the instinct for when to push forward and when to stop and reread. That judgment is what made this project work. It's also what AI cannot fully replace, not yet and maybe not ever.

The junior engineers who are pair-programming with AI, shipping things, watching what fails and why — they are building that judgment right now. Faster, and through a different kind of exposure than we had. The people telling them it doesn't count are not protecting quality.

They are protecting their biographies. And biography is not craft.

If you are young in this industry — using AI as your primary tool, building a portfolio you're proud of, getting criticized for how you built it: I do not fault you. I do not think less of you. But be honest with yourself about what you are missing. Not the output. You're producing the output. What you're missing is the comprehension underneath it — the understanding of why the thing you built is right or wrong, not just that it ran. Use most of your time with these tools not to produce, but to understand. Ask questions. Interrogate the answers. Build the pattern library that makes you dangerous in ten years, not just productive today.

Learn.

And if you have been in this industry long enough to have strong opinions about how code should be written — if you earned those opinions the hard way and you know it: you stand on the shoulders of people who understood things at a lower level of abstraction than you do. That did not diminish you. Your experience is real. Your judgment is hard-won and it matters. None of that is threatened by someone using a tool you didn't have. There are people in front of you right now, trying to become excellent at the same profession you love. You can make that harder for them, or you can make it easier.

Teach.