CogVideoX-2b CLEAR LoRA β€” Subtitle Removal (Supplementary)

This repository releases LoRA + expanded input-projection weights for video-to-video subtitle removal on top of zai-org/CogVideoX-2b.

Disclaimer: This is a supplementary experiment from the CLEAR project. The main paper results use Wan2.1-Control; this CogVideoX-2b variant is not expected to match that baseline. It is shared for reproducibility and comparison.

Architecture change (high level)

CogVideoX-2b is originally text-to-video. For conditioning, the first-stage conv input is expanded:

  • Before: patch_embed.proj: Conv2d(16 β†’ 1920, …)
  • After: patch_embed.proj: Conv2d(32 β†’ 1920, …)
    • First 16 channels: noisy latent (inherits pretrained weights)
    • Last 16 channels: subtitle-video latent (new channels, trained)

Inference concatenates noisy latent and subtitle latent along the channel dimension before the transformer, consistent with training.

Intended use

  • Research: subtitle removal / video inpainting with diffusion.
  • Not for high-stakes or misleading content; users are responsible for compliance with law and platform policies.

How to use

  1. Download CogVideoX-2b from zai-org/CogVideoX-2b.
  2. Place cogvideox_2b_CLEAR_lora_checkpoint.pt locally.
  3. Run inference with the provided script (example):
export MODEL_PATH="/path/to/CogVideoX-2b"
export CHECKPOINT="/path/to/cogvideox_2b_CLEAR_lora_checkpoint.pt"

bash scripts/inference_cogvideox_2b.sh \
  --input_video /path/to/video_with_subtitles.mp4 \
  --output_dir ./output
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal-CogvideoX

Adapter
(6)
this model