CogVideoX-2b CLEAR LoRA β Subtitle Removal (Supplementary)
This repository releases LoRA + expanded input-projection weights for video-to-video subtitle removal on top of zai-org/CogVideoX-2b.
Disclaimer: This is a supplementary experiment from the CLEAR project. The main paper results use Wan2.1-Control; this CogVideoX-2b variant is not expected to match that baseline. It is shared for reproducibility and comparison.
Architecture change (high level)
CogVideoX-2b is originally text-to-video. For conditioning, the first-stage conv input is expanded:
- Before:
patch_embed.proj: Conv2d(16 β 1920, β¦) - After:
patch_embed.proj: Conv2d(32 β 1920, β¦)- First 16 channels: noisy latent (inherits pretrained weights)
- Last 16 channels: subtitle-video latent (new channels, trained)
Inference concatenates noisy latent and subtitle latent along the channel dimension before the transformer, consistent with training.
Intended use
- Research: subtitle removal / video inpainting with diffusion.
- Not for high-stakes or misleading content; users are responsible for compliance with law and platform policies.
How to use
- Download CogVideoX-2b from
zai-org/CogVideoX-2b. - Place
cogvideox_2b_CLEAR_lora_checkpoint.ptlocally. - Run inference with the provided script (example):
export MODEL_PATH="/path/to/CogVideoX-2b"
export CHECKPOINT="/path/to/cogvideox_2b_CLEAR_lora_checkpoint.pt"
bash scripts/inference_cogvideox_2b.sh \
--input_video /path/to/video_with_subtitles.mp4 \
--output_dir ./output
- Downloads last month
- -
Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal-CogvideoX
Base model
zai-org/CogVideoX-2b