Diffusers documentation
ErnieImageTransformer2DModel
Get started
Pipelines
Adapters
Inference
Inference optimization
Modular Diffusers
Training
Quantization
Model accelerators and hardware
Resources
API
Main Classes
Modular
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AceStepTransformer1DModelAllegroTransformer3DModelAuraFlowTransformer2DModelBriaFiboTransformer2DModelBriaTransformer2DModelChromaTransformer2DModelChronoEditTransformer3DModelCogVideoXTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelConsisIDTransformer3DModelCosmosTransformer3DModelDiTTransformer2DModelEasyAnimateTransformer3DModelErnieImageTransformer2DModelFlux2Transformer2DModelFluxTransformer2DModelGlmImageTransformer2DModelHeliosTransformer3DModelHiDreamImageTransformer2DModelHunyuanDiT2DModelHunyuanImageTransformer2DModelHunyuanVideo15Transformer3DModelHunyuanVideoTransformer3DModelJoyImageEditTransformer3DModelLatteTransformer3DModelLongCatImageTransformer2DModelLTX2VideoTransformer3DModelLTXVideoTransformer3DModelLumina2Transformer2DModelLuminaNextDiT2DModelMochiTransformer3DModelMotifVideoTransformer3DModelOmniGenTransformer2DModelOvisImageTransformer2DModelPixArtTransformer2DModelPriorTransformerQwenImageTransformer2DModelSanaTransformer2DModelSanaVideoTransformer3DModelSD3Transformer2DModelSkyReelsV2Transformer3DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanAnimateTransformer3DModelWanTransformer3DModelZImageTransformer2DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.38.0).
ErnieImageTransformer2DModel
A Transformer model for image-like data from ERNIE-Image.
A Transformer model for image-like data from ERNIE-Image-Turbo.
ErnieImageTransformer2DModel
class diffusers.ErnieImageTransformer2DModel
< source >( hidden_size: int = 3072 num_attention_heads: int = 24 num_layers: int = 24 ffn_hidden_size: int = 8192 in_channels: int = 128 out_channels: int = 128 patch_size: int = 1 text_in_dim: int = 2560 rope_theta: int = 256 rope_axes_dim: typing.Tuple[int, int, int] = (32, 48, 48) eps: float = 1e-06 qk_layernorm: bool = True )
forward
< source >( hidden_states: Tensor timestep: Tensor text_bth: Tensor text_lens: Tensor return_dict: bool = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, in_channels, height, width)) — Inputhidden_states. - timestep (
torch.LongTensor) — Used to indicate denoising step. - text_bth (
torch.Tensor) — Conditional text embeddings (embeddings computed from the input conditions such as prompts) to use, shaped(batch_size, text_length, embed_dims). - text_lens (
torch.Tensor) — Per-sample text sequence lengths used to build the attention mask. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The ErnieImageTransformer2DModel forward method.