Wan2.1-Fun-V1.1-1.3B-Control-Camera (Diffusers)
Converted from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (VideoX-Fun format) to HuggingFace diffusers format.
Self-contained repo with all weights + custom model code. The transformer uses a custom WanCameraControlTransformer3DModel class (included in this repo) that extends diffusers' WanTransformer3DModel with a camera control adapter.
Quick Start
import torch
from huggingface_hub import hf_hub_download, snapshot_download
REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"
# Download the custom model class and import it
import importlib.util, sys
spec = importlib.util.spec_from_file_location(
"modeling_wan_camera",
hf_hub_download(REPO, "modeling_wan_camera.py"))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
# Load transformer with camera adapter
transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
REPO, subfolder="transformer", torch_dtype=torch.bfloat16)
# Load other pipeline components
from diffusers import AutoencoderKLWan
from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer
vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")
Or if you've cloned/downloaded the repo locally:
import sys, torch
sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
from modeling_wan_camera import WanCameraControlTransformer3DModel
transformer = WanCameraControlTransformer3DModel.from_pretrained(
"Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
torch_dtype=torch.bfloat16)
Model Details
- Architecture: WanTransformer3DModel + CameraControlAdapter
- Parameters: 1.616B total (1.564B base + 51.9M camera adapter)
- Precision: bfloat16
- in_channels: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
- Camera conditioning: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
Camera Control Architecture
Unlike the regular Control model (which concatenates control signals as extra input channels), the Camera model uses a lightweight CameraControlAdapter:
PixelUnshuffle(8)-- spatial downscale from pixel to latent resolutionConv2d(1536, 1536, k=2, s=2)-- matches patch embedding strideResidualBlock(1536)-- conv3x3 + ReLU + conv3x3 + skip
The adapter output is added to patch-embedded latents before the transformer blocks.
Transformer Forward Pass
output = transformer(
hidden_states=latents, # [B, 32, F, H, W] noise + image latents
timestep=timestep, # [B] diffusion timestep
encoder_hidden_states=text_emb, # [B, 512, 4096] text embeddings
encoder_hidden_states_image=clip_emb, # [B, 257, 1280] CLIP image tokens
control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
return_dict=False,
)[0]
Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's process_pose_file() or ray_condition() utilities from camera extrinsic matrices.
Conversion Verification
Forward-pass comparison against the original VideoX-Fun model in fp32:
- Max absolute diff: 1.67e-6 (attention backend numerical noise)
- allclose(atol=1e-2, rtol=1e-2): True
- Parameter count: identical (1,616,313,152)
Verified both from local weights and from this HuggingFace repo.
Repo Contents
| File / Directory | Description | Size |
|---|---|---|
modeling_wan_camera.py |
Custom model class (also in transformer/) |
6 KB |
transformer/ |
Converted transformer weights + config | 3.0 GB |
text_encoder/ |
UMT5-XXL text encoder | 21 GB |
image_encoder/ |
CLIP ViT-H image encoder | 1.2 GB |
vae/ |
Wan2.1 VAE | 485 MB |
tokenizer/ |
UMT5 tokenizer | 21 MB |
scheduler/ |
UniPCMultistepScheduler config | 1 KB |
image_processor/ |
CLIPImageProcessor config | 1 KB |
model_index.json |
Pipeline component index | 1 KB |
License
Apache 2.0 (same as original model)
- Downloads last month
- 5