Post
2949
A 1B model that writes GPU kernels you can trust
I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.
The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.
Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.
Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint
Try it: build-small-hackathon/ouroboros-kernel-mint
2-min demo: https://youtu.be/ViicZHktb-A
Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.
I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.
The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.
Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.
Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint
Try it: build-small-hackathon/ouroboros-kernel-mint
2-min demo: https://youtu.be/ViicZHktb-A
Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.