2 11

Rohit Yelukati Mahendra PRO

YMRohit

AI & ML interests

None yet

Recent Activity

published an article about 15 hours ago

The Method Behind OUROBOROS: Make the Reward Hard to Fool

updated a model 1 day ago

YMRohit/ouroboros-kernelsmith-minicpm5-1b

posted an update 2 days ago

A 1B model that writes GPU kernels you can trust I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark. The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell. Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool. Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint Try it: https://huggingface.co/spaces/build-small-hackathon/ouroboros-kernel-mint 2-min demo: https://youtu.be/ViicZHktb-A Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.

View all activity

Organizations

Posts 1

Post

2949

A 1B model that writes GPU kernels you can trust

I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.

The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.

Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.

Full write-up: https://huggingface.co/blog/YMRohit/ouroboros-kernel-mint
Try it: build-small-hackathon/ouroboros-kernel-mint
2-min demo: https://youtu.be/ViicZHktb-A

Built for #BuildSmallHackathon with MiniCPM, Qwen, Triton, Gradio, Codex, and Modal H200s.