Diffusers documentation

Anima

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.38.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Anima

Anima is a text-to-image model that reuses the CosmosTransformer3DModel with a Qwen3 text encoder, a T5-token text conditioner, and the AutoencoderKLQwenImage VAE.

import torch
from diffusers import ModularPipeline

pipe = ModularPipeline.from_pretrained("circlestone-labs/Anima-Base-v1.0-Diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = pipe(prompt="masterpiece, best quality, 1girl, solo, city lights").images[0]

AnimaModularPipeline

class diffusers.AnimaModularPipeline

< >

( blocks: diffusers.modular_pipelines.modular_pipeline.ModularPipelineBlocks | None = None pretrained_model_name_or_path: str | os.PathLike | None = None components_manager: diffusers.modular_pipelines.components_manager.ComponentsManager | None = None collection: str | None = None modular_config_dict: dict[str, typing.Any] | None = None config_dict: dict[str, typing.Any] | None = None **kwargs )

A ModularPipeline for Anima.

> This is an experimental feature and is likely to change in the future.

AnimaAutoBlocks

class diffusers.AnimaAutoBlocks

< >

( )

Auto Modular pipeline for text-to-image generation using Anima.

Supported workflows:

  • text2image: requires prompt

Components: text_encoder (Qwen3Model) tokenizer (Qwen2Tokenizer) t5_tokenizer (T5TokenizerFast) text_conditioner (AnimaTextConditioner) guider (ClassifierFreeGuidance) transformer (CosmosTransformer3DModel) scheduler (FlowMatchEulerDiscreteScheduler) vae (AutoencoderKLQwenImage) image_processor (VaeImageProcessor)

Inputs: prompt (str): The prompt or prompts to guide image generation. negative_prompt (str, optional): The prompt or prompts not to guide the image generation. max_sequence_length (int, optional, defaults to 512): Maximum sequence length for prompt encoding. num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt. height (int, optional): The height in pixels of the generated image. width (int, optional): The width in pixels of the generated image. latents (Tensor, optional): Pre-generated noisy latents for image generation. generator (Generator, optional): Torch generator for deterministic generation. num_inference_steps (int, optional, defaults to 50): The number of denoising steps. sigmas (list, optional): Custom sigmas for the denoising process. *denoiser_input_fields (None, optional): The conditional model inputs for the Anima denoiser. output_type (str, optional*, defaults to pil): Output format: ‘pil’, ‘np’, ‘pt’.

Outputs: images (list): Generated images.

AnimaTextConditioner

class diffusers.AnimaTextConditioner

< >

( source_dim: int = 1024 target_dim: int = 1024 model_dim: int = 1024 num_layers: int = 6 num_attention_heads: int = 16 mlp_ratio: float = 4.0 target_vocab_size: int = 32128 use_self_attention: bool = True use_layer_norm: bool = False min_sequence_length: int = 512 )

Text conditioner used by Anima to map Qwen3 hidden states and T5 token ids to Cosmos text embeddings.

Anima reuses the Cosmos Predict2 DiT. The only model-specific conditioning module is this LLM adapter, which cross-attends from learned T5 token embeddings to Qwen3 text encoder hidden states before the diffusion loop. target_dim is the conditioner output dimension and must match the transformer’s text_embed_dim.

forward

< >

( source_hidden_states: Tensor target_input_ids: Tensor target_attention_mask: torch.Tensor | None = None source_attention_mask: torch.Tensor | None = None ) torch.Tensor

Parameters

  • source_hidden_states (torch.Tensor of shape (batch_size, source_sequence_length, source_dim)) — Qwen3 text encoder hidden states to condition on.
  • target_input_ids (torch.Tensor of shape (batch_size, target_sequence_length)) — T5 token ids used as learned query tokens.
  • target_attention_mask (torch.Tensor, optional) — Attention mask for the target T5 token ids.
  • source_attention_mask (torch.Tensor, optional) — Attention mask for the source Qwen3 hidden states.

Returns

torch.Tensor

Text conditioning embeddings for the Cosmos transformer.

Update on GitHub