Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers

Model Description

This is a TI2V (Text+Image to Video) model derived from alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera (part of the VideoX-Fun project).

The camera control branch (SimpleAdapter) has been removed, leaving a clean image-to-video model that takes a reference image + text prompt and generates a video. All weights have been converted to the HuggingFace Diffusers format, so no custom model classes are needed.

Architecture: WanTransformer3DModel with in_channels=32 (16ch noisy latent + 16ch image latent, no mask channel).

Key Differences from Standard Wan 2.1 I2V

Property Standard Wan 2.1 I2V This Model
in_channels 36 (noisy 16 + mask 4 + masked_img 16) 32 (noisy 16 + img 16, no mask)
prepare_latents Encodes full video + mask First-frame-only encode, no mask
expand_timesteps True False
Pipeline class WanImageToVideoPipeline WanImageToVideoPipeline (with patched prepare_latents)

Usage

The model loads with standard WanImageToVideoPipeline from Diffusers. Two patches are required at runtime:

  1. expand_timesteps=False -- the Fun-V1.1 model family does not use expanded timesteps.
  2. Patched prepare_latents -- this model uses 32 input channels (no mask), and conditions only on the first frame latent rather than encoding the full video through VAE.
import torch
import functools
from PIL import Image
from diffusers import WanImageToVideoPipeline, FlowMatchEulerDiscreteScheduler
from diffusers.utils import export_to_video
from diffusers.utils.torch_utils import randn_tensor
from diffusers.pipelines.wan.pipeline_wan_i2v import retrieve_latents

MODEL_ID = "your-org/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers"

pipe = WanImageToVideoPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.config["expand_timesteps"] = False

# Patch prepare_latents: this model uses 32ch (no mask), not 36ch
def patch_prepare_latents(pipe):
    @functools.wraps(pipe.prepare_latents)
    def prepare_latents_no_mask(
        image, batch_size, num_channels_latents=16,
        height=480, width=832, num_frames=81,
        dtype=None, device=None, generator=None, latents=None, last_image=None,
    ):
        vae = pipe.vae
        num_latent_frames = (num_frames - 1) // pipe.vae_scale_factor_temporal + 1
        latent_height = height // pipe.vae_scale_factor_spatial
        latent_width = width // pipe.vae_scale_factor_spatial

        shape = (batch_size, num_channels_latents, num_latent_frames, latent_height, latent_width)
        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) if latents is None else latents.to(device, dtype)

        # Encode ONLY the first frame (not the full video)
        first_frame = image.unsqueeze(2).to(device=device, dtype=vae.dtype)
        with torch.no_grad():
            first_frame_latent = retrieve_latents(vae.encode(first_frame), sample_mode="argmax")

        # Normalize (required -- matches the VXF pipeline behavior)
        latents_mean = torch.tensor(vae.config.latents_mean).view(1, 16, 1, 1, 1).to(first_frame_latent)
        latents_std = (1.0 / torch.tensor(vae.config.latents_std)).view(1, 16, 1, 1, 1).to(first_frame_latent)
        first_frame_latent = (first_frame_latent - latents_mean) * latents_std

        # Place in zeros tensor -- only first frame has image conditioning
        latent_condition = torch.zeros(
            batch_size, 16, num_latent_frames, latent_height, latent_width,
            device=device, dtype=dtype,
        )
        latent_condition[:, :, :1] = first_frame_latent.to(dtype)

        return latents, latent_condition

    pipe.prepare_latents = prepare_latents_no_mask

patch_prepare_latents(pipe)
pipe.to("cuda")

# Generate video
image = Image.open("your_image.png").convert("RGB")
output = pipe(
    image=image,
    prompt="A person is speaking and listening.",
    negative_prompt="Blurry, static, low quality, worst quality",
    height=1280,        # 720p portrait
    width=720,
    num_frames=49,
    num_inference_steps=50,
    guidance_scale=6.0,
)
export_to_video(output.frames[0], "output.mp4", fps=16)

Resolution Guidelines

Resolution Dimensions Scheduler shift
720p landscape 1280x720 5.0
720p portrait 720x1280 5.0
480p landscape 832x480 3.0
480p portrait 480x832 3.0

To change the scheduler shift:

from diffusers import FlowMatchEulerDiscreteScheduler

pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, shift=3.0  # for 480p
)

Model Details

  • Base model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
  • Architecture: WanTransformer3DModel (30 layers, 12 heads, head_dim=128)
  • Parameters: ~1.3B (983 weight keys after adapter removal)
  • Image encoder: CLIP ViT-Huge (CLIPVisionModel, 1280-dim)
  • Text encoder: UMT5EncoderModel (4096-dim)
  • VAE: AutoencoderKLWan
  • Scheduler: FlowMatchEulerDiscreteScheduler
  • Precision: bfloat16

What Was Removed

The original camera-control model contains a SimpleAdapter module (6 weight keys) that processes camera parameters and adds them to the patch embedding output. This adapter has been removed entirely, leaving a clean TI2V model. The remaining 983 weight keys are unmodified.

Credits

Downloads last month
504
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for engineerA314/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers

Finetuned
(4)
this model