Wan2.1-Fun-V1.1-1.3B-Control-Camera (Diffusers)

Converted from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (VideoX-Fun format) to HuggingFace diffusers format.

Self-contained repo with all weights + custom model code. The transformer uses a custom WanCameraControlTransformer3DModel class (included in this repo) that extends diffusers' WanTransformer3DModel with a camera control adapter.

Quick Start

import torch
from huggingface_hub import hf_hub_download, snapshot_download

REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"

# Download the custom model class and import it
import importlib.util, sys
spec = importlib.util.spec_from_file_location(
    "modeling_wan_camera",
    hf_hub_download(REPO, "modeling_wan_camera.py"))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

# Load transformer with camera adapter
transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
    REPO, subfolder="transformer", torch_dtype=torch.bfloat16)

# Load other pipeline components
from diffusers import AutoencoderKLWan
from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer

vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")

Or if you've cloned/downloaded the repo locally:

import sys, torch
sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
from modeling_wan_camera import WanCameraControlTransformer3DModel

transformer = WanCameraControlTransformer3DModel.from_pretrained(
    "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
    torch_dtype=torch.bfloat16)

Model Details

Architecture: WanTransformer3DModel + CameraControlAdapter
Parameters: 1.616B total (1.564B base + 51.9M camera adapter)
Precision: bfloat16
in_channels: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
Camera conditioning: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution

Camera Control Architecture

Unlike the regular Control model (which concatenates control signals as extra input channels), the Camera model uses a lightweight CameraControlAdapter:

PixelUnshuffle(8) -- spatial downscale from pixel to latent resolution
Conv2d(1536, 1536, k=2, s=2) -- matches patch embedding stride
ResidualBlock(1536) -- conv3x3 + ReLU + conv3x3 + skip

The adapter output is added to patch-embedded latents before the transformer blocks.

Transformer Forward Pass

output = transformer(
    hidden_states=latents,           # [B, 32, F, H, W] noise + image latents
    timestep=timestep,               # [B] diffusion timestep
    encoder_hidden_states=text_emb,  # [B, 512, 4096] text embeddings
    encoder_hidden_states_image=clip_emb,  # [B, 257, 1280] CLIP image tokens
    control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
    return_dict=False,
)[0]

Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's process_pose_file() or ray_condition() utilities from camera extrinsic matrices.

Conversion Verification

Forward-pass comparison against the original VideoX-Fun model in fp32:

Max absolute diff: 1.67e-6 (attention backend numerical noise)
allclose(atol=1e-2, rtol=1e-2): True
Parameter count: identical (1,616,313,152)

Verified both from local weights and from this HuggingFace repo.

Repo Contents

File / Directory	Description	Size
`modeling_wan_camera.py`	Custom model class (also in `transformer/`)	6 KB
`transformer/`	Converted transformer weights + config	3.0 GB
`text_encoder/`	UMT5-XXL text encoder	21 GB
`image_encoder/`	CLIP ViT-H image encoder	1.2 GB
`vae/`	Wan2.1 VAE	485 MB
`tokenizer/`	UMT5 tokenizer	21 MB
`scheduler/`	UniPCMultistepScheduler config	1 KB
`image_processor/`	CLIPImageProcessor config	1 KB
`model_index.json`	Pipeline component index	1 KB

License

Apache 2.0 (same as original model)

Downloads last month: 5

Model tree for the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers

Base model

alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera

Finetuned

(4)

this model