YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VisMemVideo NoVis SFT Checkpoint
Model: Qwen2.5-VL-7B-Instruct + VisMemVideo (QueryBuilder + CrossAttentionFusion + LoRA Memory Former)
Architecture
- Reasoner: Frozen Qwen2.5-VL-7B-Instruct (provides visual context H)
- Memory Former: Separate Qwen2.5-VL-7B + LoRA adapter (trainable)
- Video Memory Bank: QwenVideoMemoryBank (reuses reasoner ViT, 3584-dim)
- QueryBuilder: 1-layer transformer encoder (8 heads, ff_mult=2)
- CrossAttentionFusion: 2-layer cross-attention (d_inner=512, 8 heads)
- Training mode: NoVis SFT (text-only input with memory augmentation, visual info only through memory tokens)
Training Config
| Parameter | Value |
|---|---|
| Dataset | Video-R1-COT-10k-video-only-4k (4000 samples) |
| Epochs | 2 |
| Effective batch size | 16 (4 GPU ร bs2 ร grad_accum 2) |
| Learning rate | 1e-5 (custom params: 5e-5) |
| Scheduler | Cosine with 3% warmup |
| invoke_prob | 0.5 |
| query_len / mem_len | 8 / 8 |
| LoRA | r=16, alpha=32, targets=q,k,v,o_proj |
| Trainable params | 134M (QB: 110M, CA: 14M, LoRA: 10M) |
Training Results
| Metric | Value |
|---|---|
| Final loss | 0.912 |
| Average loss | 0.972 |
| Ref loss (no memory) | ~1.14 |
| Final mem_lift | +0.19 |
| Training time | 2h 8min (4ร GPU) |
Checkpoints
epoch1/โ Checkpoint at step 250 (end of epoch 1)epoch2/โ Checkpoint at step 500 (end of epoch 2, final)
Loading
from open_r1.vismem_video_model import load_vismem_video_checkpoint
# Point to epoch1/ or epoch2/ directory
load_vismem_video_checkpoint(model, "path/to/epoch2")
Or via CLI argument:
--latent_checkpoint path/to/epoch2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support