Diffusers documentation

ZImageTransformer2DModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.39.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

ZImageTransformer2DModel

A Transformer model for image-like data from Z-Image.

ZImageTransformer2DModel

class diffusers.ZImageTransformer2DModel

< >

( all_patch_size = (2,)all_f_patch_size = (1,)in_channels = 16dim = 3840n_layers = 30n_refiner_layers = 2n_heads = 30n_kv_heads = 30norm_eps = 1e-05qk_norm = Truecap_feat_dim = 2560siglip_feat_dim = Nonerope_theta = 256.0t_scale = 1000.0axes_dims = [32, 48, 48]axes_lens = [1024, 512, 512] )

forward

< >

( x: listtcap_feats: listreturn_dict: bool = Truecontrolnet_block_samples: dict[int, torch.Tensor] | None = Nonesiglip_feats: list[list[torch.Tensor]] | None = Noneimage_noise_mask: list[list[int]] | None = Nonepatch_size: int = 2f_patch_size: int = 1 )

Parameters

  • x (list of torch.Tensor or nested list of torch.Tensor) — Input latents. A flat list when running in standard mode, or a nested list when running in omni mode.
  • t (torch.Tensor) — Used to indicate denoising step.
  • cap_feats (list of torch.Tensor or nested list of torch.Tensor) — Conditional caption embeddings (embeddings computed from the input conditions such as prompts) to use.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.
  • controlnet_block_samples (dict of int to torch.Tensor, optional) — A mapping from block index to tensor that if specified are added to the residuals of transformer blocks.
  • siglip_feats (list of list of torch.Tensor, optional) — Optional SigLIP image features used as additional conditioning.
  • image_noise_mask (list of list of int, optional) — Per-image noise masks indicating noisy vs. clean tokens in omni mode.
  • patch_size (int, optional, defaults to 2) — Spatial patch size used to patchify the input latents.
  • f_patch_size (int, optional, defaults to 1) — Temporal patch size used to patchify the input latents.

The ZImageTransformer2DModel forward method.

Flow: patchify -> t_embed -> x_embed -> x_refine -> cap_embed -> cap_refine -> [siglip_embed -> siglip_refine] -> build_unified -> main_layers -> final_layer -> unpatchify

patchify_and_embed

< >

( all_image: listall_cap_feats: listpatch_size: intf_patch_size: int )

Patchify for basic mode: single image per batch item.

patchify_and_embed_omni

< >

( all_x: listall_cap_feats: listall_siglip_feats: listpatch_size: intf_patch_size: intimages_noise_mask: list )

Patchify for omni mode: multiple images per batch item with noise masks.

Update on GitHub