Okay, I may have been talking out of my ass about my scheduler using less VRAM compared to a FFT. What I did find though: training only ~30% of the model's weights per step consistently beat dense SFT on Hendrycks Math across 3 different seeds.
What makes it interesting isn't just the sparsity — it's that no two consecutive windows share the same active layers. The model never has a stable path from input to output decision. Adjacent layers are rarely both alive at the same time, so the model can't build shortcuts between them. I started developing this to reduce semantic redundancy across layers and stumbled onto something I didn't expect.
Setup: Qwen2.5-3B-Instruct, simplescaling/s1K (1k reasoning traces), 5 epochs, LR 1e-5, optimizer adamw_torch_fused , and cosine scheduler with my lucky pick scheduler on an AMD MI300X 192GB.