Diffusers documentation
Anima
Anima
Anima is a text-to-image model that reuses the CosmosTransformer3DModel with a Qwen3 text encoder, a T5-token text conditioner, and the AutoencoderKLQwenImage VAE.
import torch
from diffusers import ModularPipeline
pipe = ModularPipeline.from_pretrained("circlestone-labs/Anima-Base-v1.0-Diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
image = pipe(prompt="masterpiece, best quality, 1girl, solo, city lights").images[0]AnimaModularPipeline
class diffusers.AnimaModularPipeline
< source >( blocks: diffusers.modular_pipelines.modular_pipeline.ModularPipelineBlocks | None = None pretrained_model_name_or_path: str | os.PathLike | None = None components_manager: diffusers.modular_pipelines.components_manager.ComponentsManager | None = None collection: str | None = None modular_config_dict: dict[str, typing.Any] | None = None config_dict: dict[str, typing.Any] | None = None **kwargs )
A ModularPipeline for Anima.
> This is an experimental feature and is likely to change in the future.
AnimaAutoBlocks
Auto Modular pipeline for text-to-image generation using Anima.
Supported workflows:
text2image: requiresprompt
Components:
text_encoder (Qwen3Model) tokenizer (Qwen2Tokenizer) t5_tokenizer (T5TokenizerFast) text_conditioner
(AnimaTextConditioner) guider (ClassifierFreeGuidance) transformer (CosmosTransformer3DModel) scheduler
(FlowMatchEulerDiscreteScheduler) vae (AutoencoderKLQwenImage) image_processor (VaeImageProcessor)
Inputs:
prompt (str):
The prompt or prompts to guide image generation.
negative_prompt (str, optional):
The prompt or prompts not to guide the image generation.
max_sequence_length (int, optional, defaults to 512):
Maximum sequence length for prompt encoding.
num_images_per_prompt (int, optional, defaults to 1):
The number of images to generate per prompt.
height (int, optional):
The height in pixels of the generated image.
width (int, optional):
The width in pixels of the generated image.
latents (Tensor, optional):
Pre-generated noisy latents for image generation.
generator (Generator, optional):
Torch generator for deterministic generation.
num_inference_steps (int, optional, defaults to 50):
The number of denoising steps.
sigmas (list, optional):
Custom sigmas for the denoising process.
*denoiser_input_fields (None, optional):
The conditional model inputs for the Anima denoiser.
output_type (str, optional*, defaults to pil):
Output format: ‘pil’, ‘np’, ‘pt’.
Outputs:
images (list):
Generated images.
AnimaTextConditioner
class diffusers.AnimaTextConditioner
< source >( source_dim: int = 1024 target_dim: int = 1024 model_dim: int = 1024 num_layers: int = 6 num_attention_heads: int = 16 mlp_ratio: float = 4.0 target_vocab_size: int = 32128 use_self_attention: bool = True use_layer_norm: bool = False min_sequence_length: int = 512 )
Text conditioner used by Anima to map Qwen3 hidden states and T5 token ids to Cosmos text embeddings.
Anima reuses the Cosmos Predict2 DiT. The only model-specific conditioning module is this LLM adapter, which
cross-attends from learned T5 token embeddings to Qwen3 text encoder hidden states before the diffusion loop.
target_dim is the conditioner output dimension and must match the transformer’s text_embed_dim.
forward
< source >( source_hidden_states: Tensor target_input_ids: Tensor target_attention_mask: torch.Tensor | None = None source_attention_mask: torch.Tensor | None = None ) → torch.Tensor
Parameters
- source_hidden_states (
torch.Tensorof shape(batch_size, source_sequence_length, source_dim)) — Qwen3 text encoder hidden states to condition on. - target_input_ids (
torch.Tensorof shape(batch_size, target_sequence_length)) — T5 token ids used as learned query tokens. - target_attention_mask (
torch.Tensor, optional) — Attention mask for the target T5 token ids. - source_attention_mask (
torch.Tensor, optional) — Attention mask for the source Qwen3 hidden states.
Returns
torch.Tensor
Text conditioning embeddings for the Cosmos transformer.