Diffusers documentation

ErnieImageTransformer2DModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.38.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

ErnieImageTransformer2DModel

A Transformer model for image-like data from ERNIE-Image.

A Transformer model for image-like data from ERNIE-Image-Turbo.

ErnieImageTransformer2DModel

class diffusers.ErnieImageTransformer2DModel

< >

( hidden_size: int = 3072 num_attention_heads: int = 24 num_layers: int = 24 ffn_hidden_size: int = 8192 in_channels: int = 128 out_channels: int = 128 patch_size: int = 1 text_in_dim: int = 2560 rope_theta: int = 256 rope_axes_dim: typing.Tuple[int, int, int] = (32, 48, 48) eps: float = 1e-06 qk_layernorm: bool = True )

forward

< >

( hidden_states: Tensor timestep: Tensor text_bth: Tensor text_lens: Tensor return_dict: bool = True )

Parameters

  • hidden_states (torch.Tensor of shape (batch_size, in_channels, height, width)) — Input hidden_states.
  • timestep (torch.LongTensor) — Used to indicate denoising step.
  • text_bth (torch.Tensor) — Conditional text embeddings (embeddings computed from the input conditions such as prompts) to use, shaped (batch_size, text_length, embed_dims).
  • text_lens (torch.Tensor) — Per-sample text sequence lengths used to build the attention mask.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.

The ErnieImageTransformer2DModel forward method.

Update on GitHub