CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
π Paper | π» Code | π€ Model
Overview
CLEAR is a mask-free video subtitle removal framework built on top of Wan2.1. It achieves end-to-end inference through a two-stage context-aware adaptive learning design β requiring no external text detection or segmentation modules at inference time.
- π― Mask-Free: No OCR, no segmentation β just input a video and get clean output
- π Parameter Efficient: Only 0.77% trainable parameters via LoRA (rank=64)
- π Zero-Shot Cross-Lingual: Trained on Chinese subtitles, generalizes to English, Japanese, Korean, French, German, Russian, and Arabic
- π State-of-the-Art: +6.77 dB PSNR and β74.7% VFID over best mask-dependent baselines
Performance
Evaluated on Chinese subtitle test set (rank=64, steps=5, cfg=1.0, lora_scale=1.0):
| Metric | CLEAR (Ours) |
|---|---|
| PSNR β | 26.80 |
| SSIM β | 0.894 |
| LPIPS β | 0.101 |
| DISTS β | 0.075 |
| VFID β | 20.37 |
| TWE β | 1.227 |
| TC β | 1.049 |
| Time (s/frame) | 4.86 |
Usage
Requirements
# 1. Install Wan2.1
git clone https://github.com/Wan-Video/Wan2.1.git && cd Wan2.1
pip install -r requirements.txt
# 2. Install DiffSynth-Studio
git clone https://github.com/modelscope/DiffSynth-Studio.git && cd DiffSynth-Studio
pip install -e .
# 3. Download Wan2.1 base weights
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan2.1-T2V-1.3B
# 4. Install CLEAR dependencies
pip install -r requirements.txt
Inference (Mask-Free)
python inference.py \
--model_base_path /path/to/Wan2.1-T2V-1.3B \
--lora_checkpoint ./checkpoints/CLEAR-mask-free-subtitle-removal.pt \
--lora_rank 64 \
--lora_scale 1.0 \
--input_video input_video.mp4 \
--output_dir ./results \
--num_steps 5 \
--cfg_scale 1.0 \
--use_sliding_window \
--create_comparison
Key Inference Parameters
| Parameter | Default | Description |
|---|---|---|
num_steps |
5 | Denoising steps |
cfg_scale |
1.0 | Classifier-free guidance scale |
lora_scale |
1.0 | LoRA strength (0.0β2.0) |
chunk_size |
81 | Sliding window size (frames) |
chunk_overlap |
16 | Overlap between chunks |
Method
CLEAR uses a two-stage training pipeline with mask-free inference:
Stage I β Self-Supervised Prior Learning Dual ResNet-50 encoders extract disentangled occlusion features from paired (clean, subtitled) video frames using orthogonality constraints and focal-weighted pseudo-labels β no manual annotation needed.
Stage II β Adaptive Weighting Learning A lightweight occlusion head (~2.1M params) and LoRA adaptation are jointly trained on the frozen Wan2.1 backbone. Dynamic alpha scheduling prevents over-reliance on noisy Stage I priors.
Inference β End-to-End, Mask-Free The Stage I encoder is not needed at inference time. Adaptive weighting is fully internalized into the LoRA-augmented attention, enabling single-pass DDIM generation from a subtitled video alone.
Ablation
| Configuration | PSNR β | SSIM β | LPIPS β | VFID β |
|---|---|---|---|---|
| Baseline (LoRA-only) | 21.62 | 0.855 | 0.131 | 34.74 |
| + Stage I Prior + Focal Weighting | 23.11 | 0.868 | 0.130 | 38.21 |
| + Context Distillation | 24.72 | 0.890 | 0.110 | 31.73 |
| + Context-Aware Adaptation | 25.09 | 0.891 | 0.109 | 31.56 |
| + Context Consistency (CLEAR) | 26.80 | 0.894 | 0.101 | 20.37 |
Contact
- Qingdong He β heqingdong@alu.uestc.edu.cn
- Chaoyi Wang β chaoyiwang@mail.sim.ac.cn
Acknowledgements
Built upon Wan2.1, DiffSynth-Studio, and PEFT.
Disclaimer
This model is released for research purposes only. We explicitly prohibit use for generating misinformation or any malicious application. Please ensure compliance with applicable laws and the upstream Wan2.1 Community License.
Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal
Base model
Wan-AI/Wan2.1-T2V-1.3B