CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

πŸ“„ Paper | πŸ’» Code | πŸ€— Model


Overview

CLEAR is a mask-free video subtitle removal framework built on top of Wan2.1. It achieves end-to-end inference through a two-stage context-aware adaptive learning design β€” requiring no external text detection or segmentation modules at inference time.

  • 🎯 Mask-Free: No OCR, no segmentation β€” just input a video and get clean output
  • πŸš€ Parameter Efficient: Only 0.77% trainable parameters via LoRA (rank=64)
  • 🌍 Zero-Shot Cross-Lingual: Trained on Chinese subtitles, generalizes to English, Japanese, Korean, French, German, Russian, and Arabic
  • πŸ“ˆ State-of-the-Art: +6.77 dB PSNR and βˆ’74.7% VFID over best mask-dependent baselines

Performance

Evaluated on Chinese subtitle test set (rank=64, steps=5, cfg=1.0, lora_scale=1.0):

Metric CLEAR (Ours)
PSNR ↑ 26.80
SSIM ↑ 0.894
LPIPS ↓ 0.101
DISTS ↓ 0.075
VFID ↓ 20.37
TWE ↓ 1.227
TC ↓ 1.049
Time (s/frame) 4.86

Usage

Requirements

# 1. Install Wan2.1
git clone https://github.com/Wan-Video/Wan2.1.git && cd Wan2.1
pip install -r requirements.txt

# 2. Install DiffSynth-Studio
git clone https://github.com/modelscope/DiffSynth-Studio.git && cd DiffSynth-Studio
pip install -e .

# 3. Download Wan2.1 base weights
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan2.1-T2V-1.3B

# 4. Install CLEAR dependencies
pip install -r requirements.txt

Inference (Mask-Free)

python inference.py \
    --model_base_path /path/to/Wan2.1-T2V-1.3B \
    --lora_checkpoint ./checkpoints/CLEAR-mask-free-subtitle-removal.pt \
    --lora_rank 64 \
    --lora_scale 1.0 \
    --input_video input_video.mp4 \
    --output_dir ./results \
    --num_steps 5 \
    --cfg_scale 1.0 \
    --use_sliding_window \
    --create_comparison

Key Inference Parameters

Parameter Default Description
num_steps 5 Denoising steps
cfg_scale 1.0 Classifier-free guidance scale
lora_scale 1.0 LoRA strength (0.0–2.0)
chunk_size 81 Sliding window size (frames)
chunk_overlap 16 Overlap between chunks

Method

CLEAR uses a two-stage training pipeline with mask-free inference:

Stage I β€” Self-Supervised Prior Learning Dual ResNet-50 encoders extract disentangled occlusion features from paired (clean, subtitled) video frames using orthogonality constraints and focal-weighted pseudo-labels β€” no manual annotation needed.

Stage II β€” Adaptive Weighting Learning A lightweight occlusion head (~2.1M params) and LoRA adaptation are jointly trained on the frozen Wan2.1 backbone. Dynamic alpha scheduling prevents over-reliance on noisy Stage I priors.

Inference β€” End-to-End, Mask-Free The Stage I encoder is not needed at inference time. Adaptive weighting is fully internalized into the LoRA-augmented attention, enabling single-pass DDIM generation from a subtitled video alone.


Ablation

Configuration PSNR ↑ SSIM ↑ LPIPS ↓ VFID ↓
Baseline (LoRA-only) 21.62 0.855 0.131 34.74
+ Stage I Prior + Focal Weighting 23.11 0.868 0.130 38.21
+ Context Distillation 24.72 0.890 0.110 31.73
+ Context-Aware Adaptation 25.09 0.891 0.109 31.56
+ Context Consistency (CLEAR) 26.80 0.894 0.101 20.37

Contact


Acknowledgements

Built upon Wan2.1, DiffSynth-Studio, and PEFT.

Disclaimer

This model is released for research purposes only. We explicitly prohibit use for generating misinformation or any malicious application. Please ensure compliance with applicable laws and the upstream Wan2.1 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal

Adapter
(22)
this model