Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers
Model Description
This is a TI2V (Text+Image to Video) model derived from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (part of the VideoX-Fun project).
The camera control branch (SimpleAdapter) has been removed, leaving a clean image-to-video model that takes a reference image + text prompt and generates a video. All weights have been converted to the HuggingFace Diffusers format, so no custom model classes are needed.
Architecture: WanTransformer3DModel with in_channels=32 (16ch noisy latent + 16ch image latent, no mask channel).
Key Differences from Standard Wan 2.1 I2V
| Property | Standard Wan 2.1 I2V | This Model |
|---|---|---|
in_channels |
36 (noisy 16 + mask 4 + masked_img 16) | 32 (noisy 16 + img 16, no mask) |
prepare_latents |
Encodes full video + mask | First-frame-only encode, no mask |
expand_timesteps |
True | False |
| Pipeline class | WanImageToVideoPipeline |
WanImageToVideoPipeline (with patched prepare_latents) |
Usage
The model loads with standard WanImageToVideoPipeline from Diffusers. Two patches are required at runtime:
expand_timesteps=False-- the Fun-V1.1 model family does not use expanded timesteps.- Patched
prepare_latents-- this model uses 32 input channels (no mask), and conditions only on the first frame latent rather than encoding the full video through VAE.
import torch
import functools
from PIL import Image
from diffusers import WanImageToVideoPipeline, FlowMatchEulerDiscreteScheduler
from diffusers.utils import export_to_video
from diffusers.utils.torch_utils import randn_tensor
from diffusers.pipelines.wan.pipeline_wan_i2v import retrieve_latents
MODEL_ID = "your-org/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers"
pipe = WanImageToVideoPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.config["expand_timesteps"] = False
# Patch prepare_latents: this model uses 32ch (no mask), not 36ch
def patch_prepare_latents(pipe):
@functools.wraps(pipe.prepare_latents)
def prepare_latents_no_mask(
image, batch_size, num_channels_latents=16,
height=480, width=832, num_frames=81,
dtype=None, device=None, generator=None, latents=None, last_image=None,
):
vae = pipe.vae
num_latent_frames = (num_frames - 1) // pipe.vae_scale_factor_temporal + 1
latent_height = height // pipe.vae_scale_factor_spatial
latent_width = width // pipe.vae_scale_factor_spatial
shape = (batch_size, num_channels_latents, num_latent_frames, latent_height, latent_width)
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) if latents is None else latents.to(device, dtype)
# Encode ONLY the first frame (not the full video)
first_frame = image.unsqueeze(2).to(device=device, dtype=vae.dtype)
with torch.no_grad():
first_frame_latent = retrieve_latents(vae.encode(first_frame), sample_mode="argmax")
# Normalize (required -- matches the VXF pipeline behavior)
latents_mean = torch.tensor(vae.config.latents_mean).view(1, 16, 1, 1, 1).to(first_frame_latent)
latents_std = (1.0 / torch.tensor(vae.config.latents_std)).view(1, 16, 1, 1, 1).to(first_frame_latent)
first_frame_latent = (first_frame_latent - latents_mean) * latents_std
# Place in zeros tensor -- only first frame has image conditioning
latent_condition = torch.zeros(
batch_size, 16, num_latent_frames, latent_height, latent_width,
device=device, dtype=dtype,
)
latent_condition[:, :, :1] = first_frame_latent.to(dtype)
return latents, latent_condition
pipe.prepare_latents = prepare_latents_no_mask
patch_prepare_latents(pipe)
pipe.to("cuda")
# Generate video
image = Image.open("your_image.png").convert("RGB")
output = pipe(
image=image,
prompt="A person is speaking and listening.",
negative_prompt="Blurry, static, low quality, worst quality",
height=1280, # 720p portrait
width=720,
num_frames=49,
num_inference_steps=50,
guidance_scale=6.0,
)
export_to_video(output.frames[0], "output.mp4", fps=16)
Resolution Guidelines
| Resolution | Dimensions | Scheduler shift |
|---|---|---|
| 720p landscape | 1280x720 | 5.0 |
| 720p portrait | 720x1280 | 5.0 |
| 480p landscape | 832x480 | 3.0 |
| 480p portrait | 480x832 | 3.0 |
To change the scheduler shift:
from diffusers import FlowMatchEulerDiscreteScheduler
pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(
pipe.scheduler.config, shift=3.0 # for 480p
)
Model Details
- Base model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
- Architecture: WanTransformer3DModel (30 layers, 12 heads, head_dim=128)
- Parameters: ~1.3B (983 weight keys after adapter removal)
- Image encoder: CLIP ViT-Huge (CLIPVisionModel, 1280-dim)
- Text encoder: UMT5EncoderModel (4096-dim)
- VAE: AutoencoderKLWan
- Scheduler: FlowMatchEulerDiscreteScheduler
- Precision: bfloat16
What Was Removed
The original camera-control model contains a SimpleAdapter module (6 weight keys) that processes camera parameters and adds them to the patch embedding output. This adapter has been removed entirely, leaving a clean TI2V model. The remaining 983 weight keys are unmodified.
Credits
- Original model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (VideoX-Fun by alibaba-pai)
- Diffusers conversion: the-sweater-cat (WanCameraControlTransformer3DModel approach)
- TI2V extraction and pipeline: jypark
- Downloads last month
- 504