Wiki Screenshot Embedding LoRA Checkpoints
LoRA / DoRA adapter checkpoints for Qwen3-VL-Embedding-2B, fine-tuned on Wikipedia screenshot tiles for visual document retrieval.
All evals on mini-v8 (400 queries, 7426 tiles).
Headline (Cross-reader / Cross-thinking comparison @ best ckpts)
| ckpt | R@1 | R@3 | Qwen3-VL-4B reader (no-think mt=200) | Qwen3.5-4B no-think mt=200 | Qwen3.5-4B think mt=8192 |
|---|---|---|---|---|---|
| Base (no LoRA) | 0.6875 | 0.830 | β | 0.7300 | 0.7925 |
hyper3/ckpt250 (LLM-only LoRA) |
0.7000 | 0.840 | 0.7775 | 0.7700 | 0.8225 |
lora_vit/ckpt200 |
0.7575 | 0.8900 | 0.7875 | 0.7850 | 0.8375 |
lora_vit/ckpt250 |
0.7625 | 0.8925 | 0.7800 | β | 0.8375 |
lora_vit/ckpt300 |
0.7625 | 0.8925 | 0.7825 | β | β |
dora_ls005/ckpt150 β
|
0.7575 | 0.8825 | 0.7875 | 0.7875 | 0.8525 |
dora_ls005/ckpt225 |
0.7675 | 0.8825 | 0.7800 | 0.7725 | 0.8450 |
dora_ls005/ckpt250 |
0.7575 | 0.8850 | 0.7750 | 0.7750 | 0.8325 |
Best per metric:
- R@1:
dora_ls005/ckpt225(0.7675) - R@3:
lora_vit/ckpt250/ckpt300(0.8925) - Qwen3-VL no-think QA:
dora_ls005/ckpt150/lora_vit/ckpt200tied (0.7875) - Qwen3.5 no-think QA:
dora_ls005/ckpt150(0.7875) - Qwen3.5 think QA (mt=8192):
dora_ls005/ckpt150(0.8525) β overall SOTA
dora_ls005/ckpt150 is the most well-rounded β wins all 3 QA metrics, only marginally behind on retrieval (R@1 -0.01, R@3 -0.01).
Training Configs
dora_ls005 β NEW β best QA across all readers
v8r-style + DoRA + label smoothing 0.05 + tokens=4096
- Base model:
Qwen/Qwen3-VL-Embedding-2B - Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
- LoRA: rank 32, alpha 32, DoRA enabled (
use_dora: true), targets LLM (q/k/v/o_proj) + ViT (attn.qkv, attn.proj, mlp.linear_fc1/fc2) - Loss: InfoNCE + label smoothing = 0.05
- LR: 7e-6, cosine schedule, warmup 20 (image) + 50 (text warmup)
- Batch size: 64 (single GPU; effective batch 64 β smaller than
lora_vit's 256) - Max steps: 350, save every 25
- Visual tokens: 4096
- Text warmup: 50 steps text-only before image training
Eval results (mini-v8, 400 queries / 7426 tiles)
| Step | R@1 | R@3 | Qwen3-VL-4B (no-think) | Qwen3.5-4B (no-think) | Qwen3.5-4B (think mt=8192) |
|---|---|---|---|---|---|
| 125 | 0.7425 | 0.8750 | β | 0.7800 | 0.8400 |
| 150 β | 0.7575 | 0.8825 | 0.7875 | 0.7875 | 0.8525 |
| 175 | 0.7550 | 0.8750 | β | 0.7800 | 0.8375 |
| 200 | 0.7525 | 0.8750 | β | 0.7775 | 0.8325 |
| 225 | 0.7675 | 0.8825 | 0.7800 | 0.7725 | 0.8450 |
| 250 | 0.7575 | 0.8850 | 0.7750 | 0.7750 | 0.8325 |
Why thinking helps so much
The Qwen3.5-4B reader is a reasoning model. With enable_thinking=True and large max_tokens (β₯8192) the answer quality jumps substantially:
| Reader config | Base (no LoRA) | Best ckpt (dora_ls005/ckpt150) |
|---|---|---|
| Qwen3.5 no-think mt=200 | 0.7300 | 0.7875 (+5.75%) |
| Qwen3.5 think mt=8192 | 0.7925 (+6.25%) | 0.8525 (+12.25%) |
max_tokens sensitivity for dora_ls005/ckpt150 + thinking:
- mt=2048 β 0.8175
- mt=4096 β 0.8325
- mt=8192 β 0.8525 β sweet spot
- mt=16384 β 0.8500 (plateau)
β must use max_tokens β₯ 8192 for thinking, otherwise reasoning gets truncated.
lora_vit (v8_r_warmup50_lr7e6_lora_vit_350) β original Best
- Base model:
Qwen/Qwen3-VL-Embedding-2B - Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
- LoRA: rank 32, alpha 32, target includes ViT layers (
attn.qkv,attn.proj,mlp.linear_fc1/fc2) - LR: 7e-6, cosine schedule, warmup 50 steps
- Batch size: 256 effective
- Max steps: 350
- Visual tokens: 4096
Eval Results
| Step | v6 recall@1 | v6 recall@3 | v6 QA | v6 vs base | v8 recall@1 | v8 recall@3 | v8 QA | v8 vs base |
|---|---|---|---|---|---|---|---|---|
| base | 0.655 | 0.800 | 0.645 | β | 0.690 | 0.828 | 0.708 | β |
| 100 | 0.690 | 0.855 | 0.715 | +0.070 | 0.725 | 0.870 | 0.760 | +0.052 |
| 150 | 0.720 | 0.870 | 0.735 | +0.090 | 0.750 | 0.888 | 0.780 | +0.072 |
| 200 | 0.740 | 0.860 | 0.750 | +0.105 | 0.753 | 0.890 | 0.793 | +0.085 |
Best: ckpt200 β v8 QA = 0.793 (+8.5%), v6 QA = 0.750 (+10.5%)
hyper3 (v8_i_warmup50_lr7e6_hardswitch_350)
- Base model:
Qwen/Qwen3-VL-Embedding-2B - Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
- LoRA: rank 32, alpha 32 (LLM layers only, no ViT)
- LR: 7e-6, cosine schedule, warmup 50 steps
- Batch size: 256 effective
- Max steps: 350 (hard switch)
- Visual tokens: 4096
Eval Results
| Step | v6 QA | v6 vs base | v8 QA | v8 vs base |
|---|---|---|---|---|
| base | 0.645 | β | 0.708 | β |
| 100 | 0.715 | +0.070 | 0.745 | +0.038 |
| 150 | 0.710 | +0.065 | 0.748 | +0.040 |
| 200 | 0.725 | +0.080 | 0.753 | +0.045 |
| 250 | 0.715 | +0.070 | 0.770 | +0.063 |
| 300 | 0.715 | +0.070 | 0.763 | +0.055 |
| 350 | 0.715 | +0.070 | 0.763 | +0.055 |
Best: ckpt250 (v8 QA = 0.770, +6.3% over base)
Usage
from peft import PeftModel
from transformers import AutoModel
base = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
# Best overall (DoRA + LS=0.05): wins QA on every reader
model = PeftModel.from_pretrained(
base,
"Chrisyichuan/wiki-screenshot-embedding-lora",
subfolder="dora_ls005/ckpt150",
)
DoRA is auto-detected from adapter_config.json (use_dora: true) β no extra code needed; PEFT loads lora_A, lora_B, and lora_magnitude_vector automatically.
Eval Benchmarks
- v6: 200 queries, 5291 tiles (hard-mini-v6)
- v8: 400 queries, 7426 tiles (hard-mini-v8, preferred benchmark)
- QA score pipeline: retrieval top-3 β VQA reader (Qwen3-VL-4B-Instruct or Qwen3.5-4B) β GPT-4.1 grader (correct/incorrect)
- For Qwen3.5 reader:
enable_thinking=True+max_tokens=8192recommended;enable_thinking=False+max_tokens=200is the fast/cheap baseline.