CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

📄 Paper | 💻 Code | 🤗 Model

Overview

CLEAR is a mask-free video subtitle removal framework built on top of Wan2.1. It achieves end-to-end inference through a two-stage context-aware adaptive learning design — requiring no external text detection or segmentation modules at inference time.

🎯 Mask-Free: No OCR, no segmentation — just input a video and get clean output
🚀 Parameter Efficient: Only 0.77% trainable parameters via LoRA (rank=64)
🌍 Zero-Shot Cross-Lingual: Trained on Chinese subtitles, generalizes to English, Japanese, Korean, French, German, Russian, and Arabic
📈 State-of-the-Art: +6.77 dB PSNR and −74.7% VFID over best mask-dependent baselines

Performance

Evaluated on Chinese subtitle test set (rank=64, steps=5, cfg=1.0, lora_scale=1.0):

Metric	CLEAR (Ours)
PSNR ↑	26.80
SSIM ↑	0.894
LPIPS ↓	0.101
DISTS ↓	0.075
VFID ↓	20.37
TWE ↓	1.227
TC ↓	1.049
Time (s/frame)	4.86

Usage

Requirements

# 1. Install Wan2.1
git clone https://github.com/Wan-Video/Wan2.1.git && cd Wan2.1
pip install -r requirements.txt

# 2. Install DiffSynth-Studio
git clone https://github.com/modelscope/DiffSynth-Studio.git && cd DiffSynth-Studio
pip install -e .

# 3. Download Wan2.1 base weights
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan2.1-T2V-1.3B

# 4. Install CLEAR dependencies
pip install -r requirements.txt

Inference (Mask-Free)

python inference.py \
    --model_base_path /path/to/Wan2.1-T2V-1.3B \
    --lora_checkpoint ./checkpoints/CLEAR-mask-free-subtitle-removal.pt \
    --lora_rank 64 \
    --lora_scale 1.0 \
    --input_video input_video.mp4 \
    --output_dir ./results \
    --num_steps 5 \
    --cfg_scale 1.0 \
    --use_sliding_window \
    --create_comparison

Key Inference Parameters

Parameter	Default	Description
`num_steps`	5	Denoising steps
`cfg_scale`	1.0	Classifier-free guidance scale
`lora_scale`	1.0	LoRA strength (0.0–2.0)
`chunk_size`	81	Sliding window size (frames)
`chunk_overlap`	16	Overlap between chunks

Method

CLEAR uses a two-stage training pipeline with mask-free inference:

Stage I — Self-Supervised Prior Learning Dual ResNet-50 encoders extract disentangled occlusion features from paired (clean, subtitled) video frames using orthogonality constraints and focal-weighted pseudo-labels — no manual annotation needed.

Stage II — Adaptive Weighting Learning A lightweight occlusion head (~2.1M params) and LoRA adaptation are jointly trained on the frozen Wan2.1 backbone. Dynamic alpha scheduling prevents over-reliance on noisy Stage I priors.

Inference — End-to-End, Mask-Free The Stage I encoder is not needed at inference time. Adaptive weighting is fully internalized into the LoRA-augmented attention, enabling single-pass DDIM generation from a subtitled video alone.

Ablation

Configuration	PSNR ↑	SSIM ↑	LPIPS ↓	VFID ↓
Baseline (LoRA-only)	21.62	0.855	0.131	34.74
+ Stage I Prior + Focal Weighting	23.11	0.868	0.130	38.21
+ Context Distillation	24.72	0.890	0.110	31.73
+ Context-Aware Adaptation	25.09	0.891	0.109	31.56
+ Context Consistency (CLEAR)	26.80	0.894	0.101	20.37

Contact

Qingdong He — heqingdong@alu.uestc.edu.cn
Chaoyi Wang — chaoyiwang@mail.sim.ac.cn

Acknowledgements

Built upon Wan2.1, DiffSynth-Studio, and PEFT.

Disclaimer

This model is released for research purposes only. We explicitly prohibit use for generating misinformation or any malicious application. Please ensure compliance with applicable laws and the upstream Wan2.1 Community License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for charlesw09/CLEAR-mask-free-video-subtitle-removal

Base model

Wan-AI/Wan2.1-T2V-1.3B

Adapter

(22)

this model