CogVideoX-2b CLEAR LoRA — Subtitle Removal (Supplementary)

This repository releases LoRA + expanded input-projection weights for video-to-video subtitle removal on top of zai-org/CogVideoX-2b.

Disclaimer: This is a supplementary experiment from the CLEAR project. The main paper results use Wan2.1-Control; this CogVideoX-2b variant is not expected to match that baseline. It is shared for reproducibility and comparison.

Architecture change (high level)

CogVideoX-2b is originally text-to-video. For conditioning, the first-stage conv input is expanded:

Before: patch_embed.proj: Conv2d(16 → 1920, …)
After: patch_embed.proj: Conv2d(32 → 1920, …)
- First 16 channels: noisy latent (inherits pretrained weights)
- Last 16 channels: subtitle-video latent (new channels, trained)

Inference concatenates noisy latent and subtitle latent along the channel dimension before the transformer, consistent with training.

Intended use

Research: subtitle removal / video inpainting with diffusion.
Not for high-stakes or misleading content; users are responsible for compliance with law and platform policies.

How to use

Download CogVideoX-2b from zai-org/CogVideoX-2b.
Place cogvideox_2b_CLEAR_lora_checkpoint.pt locally.
Run inference with the provided script (example):

export MODEL_PATH="/path/to/CogVideoX-2b"
export CHECKPOINT="/path/to/cogvideox_2b_CLEAR_lora_checkpoint.pt"

bash scripts/inference_cogvideox_2b.sh \
  --input_video /path/to/video_with_subtitles.mp4 \
  --output_dir ./output

Downloads last month: -

Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal-CogvideoX

Base model

zai-org/CogVideoX-2b

Adapter

(6)

this model