HobbyLM-Image β€” 1024px text-to-image DiT

The odd one out in the HobbyLM family: not a language model, but a 333M in-context flow-matching DiT that generates 1024Γ—1024 images. It was built to see how good a text-to-image model you can train on a genuinely small budget β€” the whole thing came together for roughly $300 of Modal GPU time by working in a heavily compressed latent space instead of pixels.

It runs in the DC-AE f32c32 (SANA-1.1) latent (32Γ— spatial compression β†’ a 32Γ—32Γ—32 latent at 1024px) and is conditioned on CLIP-L text features, with classifier-free guidance.

Intended use

Text-to-image generation at 1024Γ—1024. Strongest on single objects and cinematic scenes. A sibling 512px checkpoint additionally does instruction-based image editing.

How it works

CLIP-L(prompt) ─┐
                β”œβ”€β–Ί  DiT  ──(rectified-flow / CFG sampler, ~100 steps)──►  latent  ──►  DC-AE decode  ──►  1024Β² image
 Gaussian noise β”€β”˜     (this repo)                                                       (frozen VAE)

The two frozen components are not included (download them from their own repos): mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers (VAE) and openai/clip-vit-large-patch14 (text encoder). A full from-scratch CPU implementation of this pipeline (CLIP + DiT + DC-AE, in Rust) lives in hobby-rs.

Samples

1024Γ—1024, generated by this model (CFG β‰ˆ 5, ~100 steps):

HobbyLM-Image scene samples

Results

This is a hobby-scale generator, so the honest "benchmark" is the training curve and qualitative behaviour rather than FID / GenEval (which we did not compute):

Property Value
Flow-matching loss (final) 0.76 (lowest of the model lineage β€” still decreasing)
Parameters 333M (DiT only)
Resolution 1024Γ—1024 (32Γ—32Γ—32 latent)
VAE reconstruction ~26 dB PSNR @512px; sharper at 1024px (32Γ—32 latent)

Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is soft on people, hands, and multi-person scenes β€” the real small-model / latent-resolution ceiling. Loss was still dropping at the end of training, so the 333M DiT is not yet saturated.

Files

  • model.safetensors β€” the DiT weights.
  • config.json β€” DiT config, lat_std, and the VAE scaling_factor.

There is no GGUF build: image-generation DiTs have no standard GGUF runtime.

Limitations

  • Hands and multi-person scenes are unreliable.
  • Fine object crispness is capped by the 32Γ— DC-AE latent; a less-compressed VAE would sharpen it at higher cost.
  • Instruction-based editing is limited (the CLIP-L text encoder is a weak instruction follower); the real fix is a stronger conditioner, which is future work.

License

Apache-2.0.

Downloads last month
15
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using rootxhacker/HobbyLM-Image 1

Collection including rootxhacker/HobbyLM-Image