HobbyLM-Image β 1024px text-to-image DiT
The odd one out in the HobbyLM family: not a language model, but a 333M in-context flow-matching DiT that generates 1024Γ1024 images. It was built to see how good a text-to-image model you can train on a genuinely small budget β the whole thing came together for roughly $300 of Modal GPU time by working in a heavily compressed latent space instead of pixels.
It runs in the DC-AE f32c32 (SANA-1.1) latent (32Γ spatial compression β a 32Γ32Γ32 latent at 1024px) and is conditioned on CLIP-L text features, with classifier-free guidance.
Intended use
Text-to-image generation at 1024Γ1024. Strongest on single objects and cinematic scenes. A sibling 512px checkpoint additionally does instruction-based image editing.
How it works
CLIP-L(prompt) ββ
βββΊ DiT ββ(rectified-flow / CFG sampler, ~100 steps)βββΊ latent βββΊ DC-AE decode βββΊ 1024Β² image
Gaussian noise ββ (this repo) (frozen VAE)
The two frozen components are not included (download them from their own repos):
mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers (VAE) and openai/clip-vit-large-patch14 (text encoder).
A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in
hobby-rs.
Samples
1024Γ1024, generated by this model (CFG β 5, ~100 steps):
Results
This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour rather than FID / GenEval (which we did not compute):
| Property | Value |
|---|---|
| Flow-matching loss (final) | 0.76 (lowest of the model lineage β still decreasing) |
| Parameters | 333M (DiT only) |
| Resolution | 1024Γ1024 (32Γ32Γ32 latent) |
| VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32Γ32 latent) |
Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is soft on people, hands, and multi-person scenes β the real small-model / latent-resolution ceiling. Loss was still dropping at the end of training, so the 333M DiT is not yet saturated.
Files
model.safetensorsβ the DiT weights.config.jsonβ DiT config,lat_std, and the VAEscaling_factor.
There is no GGUF build: image-generation DiTs have no standard GGUF runtime.
Limitations
- Hands and multi-person scenes are unreliable.
- Fine object crispness is capped by the 32Γ DC-AE latent; a less-compressed VAE would sharpen it at higher cost.
- Instruction-based editing is limited (the CLIP-L text encoder is a weak instruction follower); the real fix is a stronger conditioner, which is future work.
License
Apache-2.0.
- Downloads last month
- 15
