Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers

Model Description

This is a TI2V (Text+Image to Video) model derived from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (part of the VideoX-Fun project).

The camera control branch (SimpleAdapter) has been removed, leaving a clean image-to-video model that takes a reference image + text prompt and generates a video. All weights have been converted to the HuggingFace Diffusers format, so no custom model classes are needed.

Architecture: WanTransformer3DModel with in_channels=32 (16ch noisy latent + 16ch image latent, no mask channel).

Key Differences from Standard Wan 2.1 I2V

Property	Standard Wan 2.1 I2V	This Model
`in_channels`	36 (noisy 16 + mask 4 + masked_img 16)	32 (noisy 16 + img 16, no mask)
`prepare_latents`	Encodes full video + mask	First-frame-only encode, no mask
`expand_timesteps`	True	False
Pipeline class	`WanImageToVideoPipeline`	`WanImageToVideoPipeline` (with patched `prepare_latents`)

Usage

The model loads with standard WanImageToVideoPipeline from Diffusers. Two patches are required at runtime:

expand_timesteps=False -- the Fun-V1.1 model family does not use expanded timesteps.
Patched prepare_latents -- this model uses 32 input channels (no mask), and conditions only on the first frame latent rather than encoding the full video through VAE.

import torch
import functools
from PIL import Image
from diffusers import WanImageToVideoPipeline, FlowMatchEulerDiscreteScheduler
from diffusers.utils import export_to_video
from diffusers.utils.torch_utils import randn_tensor
from diffusers.pipelines.wan.pipeline_wan_i2v import retrieve_latents

MODEL_ID = "your-org/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers"

pipe = WanImageToVideoPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.config["expand_timesteps"] = False

# Patch prepare_latents: this model uses 32ch (no mask), not 36ch
def patch_prepare_latents(pipe):
    @functools.wraps(pipe.prepare_latents)
    def prepare_latents_no_mask(
        image, batch_size, num_channels_latents=16,
        height=480, width=832, num_frames=81,
        dtype=None, device=None, generator=None, latents=None, last_image=None,
    ):
        vae = pipe.vae
        num_latent_frames = (num_frames - 1) // pipe.vae_scale_factor_temporal + 1
        latent_height = height // pipe.vae_scale_factor_spatial
        latent_width = width // pipe.vae_scale_factor_spatial

        shape = (batch_size, num_channels_latents, num_latent_frames, latent_height, latent_width)
        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) if latents is None else latents.to(device, dtype)

        # Encode ONLY the first frame (not the full video)
        first_frame = image.unsqueeze(2).to(device=device, dtype=vae.dtype)
        with torch.no_grad():
            first_frame_latent = retrieve_latents(vae.encode(first_frame), sample_mode="argmax")

        # Normalize (required -- matches the VXF pipeline behavior)
        latents_mean = torch.tensor(vae.config.latents_mean).view(1, 16, 1, 1, 1).to(first_frame_latent)
        latents_std = (1.0 / torch.tensor(vae.config.latents_std)).view(1, 16, 1, 1, 1).to(first_frame_latent)
        first_frame_latent = (first_frame_latent - latents_mean) * latents_std

        # Place in zeros tensor -- only first frame has image conditioning
        latent_condition = torch.zeros(
            batch_size, 16, num_latent_frames, latent_height, latent_width,
            device=device, dtype=dtype,
        )
        latent_condition[:, :, :1] = first_frame_latent.to(dtype)

        return latents, latent_condition

    pipe.prepare_latents = prepare_latents_no_mask

patch_prepare_latents(pipe)
pipe.to("cuda")

# Generate video
image = Image.open("your_image.png").convert("RGB")
output = pipe(
    image=image,
    prompt="A person is speaking and listening.",
    negative_prompt="Blurry, static, low quality, worst quality",
    height=1280,        # 720p portrait
    width=720,
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=6.0,
)
export_to_video(output.frames[0], "output.mp4", fps=16)

Resolution Guidelines

Resolution	Dimensions	Scheduler `shift`
720p landscape	1280x720	5.0
720p portrait	720x1280	5.0
480p landscape	832x480	3.0
480p portrait	480x832	3.0

To change the scheduler shift:

from diffusers import FlowMatchEulerDiscreteScheduler

pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, shift=3.0  # for 480p
)

Model Details

Base model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
Architecture: WanTransformer3DModel (30 layers, 12 heads, head_dim=128)
Parameters: ~1.3B (983 weight keys after adapter removal)
Image encoder: CLIP ViT-Huge (CLIPVisionModel, 1280-dim)
Text encoder: UMT5EncoderModel (4096-dim)
VAE: AutoencoderKLWan
Scheduler: FlowMatchEulerDiscreteScheduler
Precision: bfloat16

What Was Removed

The original camera-control model contains a SimpleAdapter module (6 weight keys) that processes camera parameters and adds them to the patch embedding output. This adapter has been removed entirely, leaving a clean TI2V model. The remaining 983 weight keys are unmodified.

Credits

Original model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (VideoX-Fun by alibaba-pai)
Diffusers conversion: the-sweater-cat (WanCameraControlTransformer3DModel approach)
TI2V extraction and pipeline: jypark

Downloads last month: 504

Model tree for engineerA314/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers

Base model

alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera

Finetuned

(4)

this model