Latent Semantic Planning for Video Diffusion
Chenchen Liu*, Junyi Chen*, Lei Li*, Lu Chi*,Β§, Mingzhen Sun*, Zhuoying Li*, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuanβ
* Equal contribution β Corresponding author Β§ Project lead
π News
- [2026-06-01] We open-sourced the inference code and model weights of the Bernini Renderer (Bernini-R).
- [2026-05-22] We released our paper Bernini: Latent Semantic Planning for Video Diffusion.
β¨ Highlights
Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.
On video editing, Bernini reaches the first tier among leading closed-source commercial models. The leaderboard below comes from our self-built arena platform, where human annotators blindly vote on paired edits and the votes are aggregated into a Bradley-Terry score and a pairwise win-rate matrix.
π¦ Installation
Requirements
- Python 3.11.2.
- CUDA GPU β a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3 can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
- CUDA toolkit 12.4 (matches the pinned
torch==2.5.1+cu124; 12.3+ is the minimum if you build FlashAttention-3). - Pinned in
requirements.txt:torch==2.5.1+cu124,diffusers==0.35.2,accelerate==0.34.2,transformers==4.57.3.
Reference environment (Bernini-R is developed and tested on this setup):
| Component | Version |
|---|---|
| GPU | NVIDIA H100 |
| CUDA | 12.4 |
| Python | 3.11.2 |
| PyTorch | 2.5.1+cu124 |
Install
git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt
Optional extras:
- Multi-GPU sequence parallel needs Open-VeOmni
(Apache-2.0, Python 3.11). Use
--no-depsso VeOmni does not pull in a different torch build and override the pinnedtorch==2.5.1+cu124:pip install --no-deps git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.10. Single-GPU inference does not need it. - Faster attention (auto-detected if installed; otherwise PyTorch SDPA is used):
- FlashAttention-2 β general CUDA GPUs (incl. A100/A800):
pip install flash-attn==2.8.3. - FlashAttention-3 β Hopper only (H100/H800/H200, CUDA β₯ 12.3, PyTorch β₯ 2.4).
flash_attn_interfaceis not on PyPI; build it from the flash-attention repo'shopper/directory at tagv2.8.3:git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention && git checkout v2.8.3 cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user
- FlashAttention-2 β general CUDA GPUs (incl. A100/A800):
Weights
Bernini-R uses two sets of weights:
- Wan2.2 base β
Wan-AI/Wan2.2-T2V-A14B-Diffuserson Hugging Face. Supplies the VAE, UMT5 text encoder, tokenizer, and the transformer architecture/base weights. It is downloaded automatically on first run (configured bywan22_baseinconfigs/bernini_renderer_wan22/config.json). - Bernini-R checkpoint β the trained high-noise / low-noise transformer weights
(safetensors) from Hugging Face, passed with
--high_noise_ckpt/--low_noise_ckpt. Both a local directory and a Hugging Face repo id are accepted.
Download models using huggingface-cli:
pip install -U "huggingface_hub"
hf download Wan-AI/Wan2.2-T2V-A14B-Diffusers --local-dir Wan2.2-T2V-A14B-Diffusers
hf download ByteDance/Bernini --local-dir Bernini
π Usage
A run is described by a case file β a small JSON under
assets/testcases/ that bundles one task's routing and
inputs (task_type, guidance_mode, prompt, source media, output). This
keeps long prompts out of the command line. Each task has a directory under
assets/testcases/ holding one or more case files; see
assets/testcases/ for the format and the bundled
t2i / i2i / t2v / v2v / rv2v /r2v examples.
Prompt enhancer (recommended)
--use_pe enhances the prompt through an OpenAI-compatible endpoint and is
recommended for best generation quality. The openai SDK is installed by
requirements.txt; configure the endpoint with environment variables:
export BERNINI_PE_API_KEY=... # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=... # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=... # vision-capable chat model
Examples by task type
Unless an example specifies otherwise, inference outputs 480p / 16fps (the
defaults β --max_image_size 848, --fps 16).
Each example runs a bundled case in
assets/testcases/ β replace <hi> / <lo> with your
high-/low-noise checkpoint paths. The image tasks (t2i, i2i) are shown on a
single GPU; the video tasks on 8 GPUs via torchrun, where --ulysses N gives
N-way Ulysses sequence parallel per sample and the remaining world_size / N
ranks run data parallel over the task list. The two scripts take the same
inputs, so any example can be run either way.
Inputs can also be passed directly as flags instead of --case (--prompt,
--task_type, --guidance_mode, --video, --image, --images,
--output); generation parameters (--seed, --num_frames, ...) are always
command-line flags.
Text-to-image (t2i) β single GPU; generates one frame, so pass --num_frames 1
python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
--case assets/testcases/t2i/t2i.json --num_frames 1
Image editing (i2i) β single GPU; generates one frame, so pass --num_frames 1
python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
--case assets/testcases/i2i/i2i.json --num_frames 1
Text-to-video (t2v)
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/t2v/t2v.json
Video editing (v2v / mv2v) β two cases are provided.
For edits where the main subject keeps its ordinary motion (case 1 adds a
snowman to the scene), the v2v task type is enough:
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/v2v/v2v_case1.json
For edits that need to change the subject's motion (case 2 makes the person
crouch down), the mv2v task type gives better results:
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/v2v/v2v_case2.json
Reference + video editing (rv2v) β two cases are provided.
Case 1 is reference-image-guided video editing β replacing a garment in the source video with one from a reference image:
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/rv2v/rv2v_case1.json
Case 2 is a video-insertion example β inserting content into the source video. It is run at 720p / 24fps to show the insertion result more clearly:
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/rv2v/rv2v_case2.json \
--num_frames 121 --fps 24 --max_image_size 1280
Reference-to-video (r2v) β drives a video from one or more reference images
torchrun --nproc-per-node 8 infer_multi_gpu.py \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
--case assets/testcases/r2v/r2v.json
See python infer_single_gpu.py --help for the full argument list.
Gradio demo
gradio_demo.py exposes the same pipeline through a Gradio UI: the task-type
dropdown auto-fills guidance_mode (still user-editable), uploaded media is
routed to the matching slot, and the result is rendered inline.
# Single GPU
python gradio_demo.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860
# 8 GPUs, 8-way Ulysses sequence parallel
torchrun --nproc-per-node 8 gradio_demo.py --ulysses 8 \
--high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860 --share
Add --use_pe (and export OPENAI_API_KEY=... / BERNINI_PE_API_KEY=...) to
enable GPT prompt enhancement; the in-UI checkbox is a per-request switch on
top of this flag.
π Citation
If you use Bernini in your research, please cite:
@article{bernini,
title = {Bernini: Latent Semantic Planning for Video Diffusion},
author = {Chenchen Liu and Junyi Chen and Lei Li and Lu Chi and Mingzhen Sun and Zhuoying Li and Yi Fu and Ruoyu Guo and Yiheng Wu and Ge Bai and Zehuan Yuan},
journal = {arXiv preprint arXiv:2605.22344},
year = {2026}
}
π Acknowledgements
Bernini builds on several outstanding open-source projects:
We thank the authors and communities of these projects for their contributions.
π License
Apache License 2.0. See LICENSE.
- Downloads last month
- -