Anima

Anima is a text-to-image model that reuses the CosmosTransformer3DModel with a Qwen3 text encoder, a T5-token text conditioner, and the AutoencoderKLQwenImage VAE.

import torch
from diffusers import ModularPipeline

pipe = ModularPipeline.from_pretrained("circlestone-labs/Anima-Base-v1.0-Diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = pipe(prompt="masterpiece, best quality, 1girl, solo, city lights").images[0]

AnimaModularPipeline

class diffusers.AnimaModularPipeline

< source >

( blocks: diffusers.modular_pipelines.modular_pipeline.ModularPipelineBlocks | None = Nonepretrained_model_name_or_path: str | os.PathLike | None = Nonecomponents_manager: diffusers.modular_pipelines.components_manager.ComponentsManager | None = Nonecollection: str | None = Nonemodular_config_dict: dict[str, typing.Any] | None = Noneconfig_dict: dict[str, typing.Any] | None = None**kwargs )

A ModularPipeline for Anima.

> This is an experimental feature and is likely to change in the future.

AnimaAutoBlocks

class diffusers.AnimaAutoBlocks

< source >

( )

Auto Modular pipeline for text-to-image generation using Anima.

Supported workflows:

text2image: requires prompt

Components: text_encoder (Qwen3Model) tokenizer (Qwen2Tokenizer) t5_tokenizer (T5TokenizerFast) text_conditioner (AnimaTextConditioner) guider (ClassifierFreeGuidance) transformer (CosmosTransformer3DModel) scheduler (FlowMatchEulerDiscreteScheduler) vae (AutoencoderKLQwenImage) image_processor (VaeImageProcessor)

Inputs: prompt (str): The prompt or prompts to guide image generation. negative_prompt (str, optional): The prompt or prompts not to guide the image generation. max_sequence_length (int, optional, defaults to 512): Maximum sequence length for prompt encoding. num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt. height (int, optional): The height in pixels of the generated image. width (int, optional): The width in pixels of the generated image. latents (Tensor, optional): Pre-generated noisy latents for image generation. generator (Generator, optional): Torch generator for deterministic generation. num_inference_steps (int, optional, defaults to 50): The number of denoising steps. sigmas (list, optional): Custom sigmas for the denoising process. *denoiser_input_fields (None,optional): The conditional model inputs for the Anima denoiser. output_type (str,optional*, defaults to pil): Output format: ‘pil’, ‘np’, ‘pt’.

Outputs: images (list): Generated images.

AnimaTextConditioner

class diffusers.AnimaTextConditioner

< source >

( source_dim: int = 1024target_dim: int = 1024model_dim: int = 1024num_layers: int = 6num_attention_heads: int = 16mlp_ratio: float = 4.0target_vocab_size: int = 32128use_self_attention: bool = Trueuse_layer_norm: bool = Falsemin_sequence_length: int = 512 )

Text conditioner used by Anima to map Qwen3 hidden states and T5 token ids to Cosmos text embeddings.

Anima reuses the Cosmos Predict2 DiT. The only model-specific conditioning module is this LLM adapter, which cross-attends from learned T5 token embeddings to Qwen3 text encoder hidden states before the diffusion loop. target_dim is the conditioner output dimension and must match the transformer’s text_embed_dim.

forward

< source >

( source_hidden_states: Tensortarget_input_ids: Tensortarget_attention_mask: typing.Optional[torch.Tensor] = Nonesource_attention_mask: typing.Optional[torch.Tensor] = None ) → torch.Tensor

Parameters

source_hidden_states (torch.Tensor of shape (batch_size, source_sequence_length, source_dim)) — Qwen3 text encoder hidden states to condition on.
target_input_ids (torch.Tensor of shape (batch_size, target_sequence_length)) — T5 token ids used as learned query tokens.
target_attention_mask (torch.Tensor, optional) — Attention mask for the target T5 token ids.
source_attention_mask (torch.Tensor, optional) — Attention mask for the source Qwen3 hidden states.

Returns

torch.Tensor

Text conditioning embeddings for the Cosmos transformer.

Update on GitHub