Wan2.1-Fun-V1.1-1.3B-Control-Camera (Diffusers)

Converted from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (VideoX-Fun format) to HuggingFace diffusers format.

Self-contained repo with all weights + custom model code. The transformer uses a custom WanCameraControlTransformer3DModel class (included in this repo) that extends diffusers' WanTransformer3DModel with a camera control adapter.

Quick Start

import torch
from huggingface_hub import hf_hub_download, snapshot_download

REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"

# Download the custom model class and import it
import importlib.util, sys
spec = importlib.util.spec_from_file_location(
    "modeling_wan_camera",
    hf_hub_download(REPO, "modeling_wan_camera.py"))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

# Load transformer with camera adapter
transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
    REPO, subfolder="transformer", torch_dtype=torch.bfloat16)

# Load other pipeline components
from diffusers import AutoencoderKLWan
from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer

vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")

Or if you've cloned/downloaded the repo locally:

import sys, torch
sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
from modeling_wan_camera import WanCameraControlTransformer3DModel

transformer = WanCameraControlTransformer3DModel.from_pretrained(
    "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
    torch_dtype=torch.bfloat16)

Model Details

  • Architecture: WanTransformer3DModel + CameraControlAdapter
  • Parameters: 1.616B total (1.564B base + 51.9M camera adapter)
  • Precision: bfloat16
  • in_channels: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
  • Camera conditioning: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution

Camera Control Architecture

Unlike the regular Control model (which concatenates control signals as extra input channels), the Camera model uses a lightweight CameraControlAdapter:

  1. PixelUnshuffle(8) -- spatial downscale from pixel to latent resolution
  2. Conv2d(1536, 1536, k=2, s=2) -- matches patch embedding stride
  3. ResidualBlock(1536) -- conv3x3 + ReLU + conv3x3 + skip

The adapter output is added to patch-embedded latents before the transformer blocks.

Transformer Forward Pass

output = transformer(
    hidden_states=latents,           # [B, 32, F, H, W] noise + image latents
    timestep=timestep,               # [B] diffusion timestep
    encoder_hidden_states=text_emb,  # [B, 512, 4096] text embeddings
    encoder_hidden_states_image=clip_emb,  # [B, 257, 1280] CLIP image tokens
    control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
    return_dict=False,
)[0]

Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's process_pose_file() or ray_condition() utilities from camera extrinsic matrices.

Conversion Verification

Forward-pass comparison against the original VideoX-Fun model in fp32:

  • Max absolute diff: 1.67e-6 (attention backend numerical noise)
  • allclose(atol=1e-2, rtol=1e-2): True
  • Parameter count: identical (1,616,313,152)

Verified both from local weights and from this HuggingFace repo.

Repo Contents

File / Directory Description Size
modeling_wan_camera.py Custom model class (also in transformer/) 6 KB
transformer/ Converted transformer weights + config 3.0 GB
text_encoder/ UMT5-XXL text encoder 21 GB
image_encoder/ CLIP ViT-H image encoder 1.2 GB
vae/ Wan2.1 VAE 485 MB
tokenizer/ UMT5 tokenizer 21 MB
scheduler/ UniPCMultistepScheduler config 1 KB
image_processor/ CLIPImageProcessor config 1 KB
model_index.json Pipeline component index 1 KB

License

Apache 2.0 (same as original model)

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers

Finetuned
(4)
this model