Wiki Screenshot Embedding LoRA Checkpoints

LoRA / DoRA adapter checkpoints for Qwen3-VL-Embedding-2B, fine-tuned on Wikipedia screenshot tiles for visual document retrieval.

All evals on mini-v8 (400 queries, 7426 tiles).


Headline (Cross-reader / Cross-thinking comparison @ best ckpts)

ckpt R@1 R@3 Qwen3-VL-4B reader (no-think mt=200) Qwen3.5-4B no-think mt=200 Qwen3.5-4B think mt=8192
Base (no LoRA) 0.6875 0.830 β€” 0.7300 0.7925
hyper3/ckpt250 (LLM-only LoRA) 0.7000 0.840 0.7775 0.7700 0.8225
lora_vit/ckpt200 0.7575 0.8900 0.7875 0.7850 0.8375
lora_vit/ckpt250 0.7625 0.8925 0.7800 β€” 0.8375
lora_vit/ckpt300 0.7625 0.8925 0.7825 β€” β€”
dora_ls005/ckpt150 β˜… 0.7575 0.8825 0.7875 0.7875 0.8525
dora_ls005/ckpt225 0.7675 0.8825 0.7800 0.7725 0.8450
dora_ls005/ckpt250 0.7575 0.8850 0.7750 0.7750 0.8325

Best per metric:

  • R@1: dora_ls005/ckpt225 (0.7675)
  • R@3: lora_vit/ckpt250 / ckpt300 (0.8925)
  • Qwen3-VL no-think QA: dora_ls005/ckpt150 / lora_vit/ckpt200 tied (0.7875)
  • Qwen3.5 no-think QA: dora_ls005/ckpt150 (0.7875)
  • Qwen3.5 think QA (mt=8192): dora_ls005/ckpt150 (0.8525) β˜… overall SOTA

dora_ls005/ckpt150 is the most well-rounded β€” wins all 3 QA metrics, only marginally behind on retrieval (R@1 -0.01, R@3 -0.01).


Training Configs

dora_ls005 β˜… NEW β€” best QA across all readers

v8r-style + DoRA + label smoothing 0.05 + tokens=4096

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32, DoRA enabled (use_dora: true), targets LLM (q/k/v/o_proj) + ViT (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
  • Loss: InfoNCE + label smoothing = 0.05
  • LR: 7e-6, cosine schedule, warmup 20 (image) + 50 (text warmup)
  • Batch size: 64 (single GPU; effective batch 64 β€” smaller than lora_vit's 256)
  • Max steps: 350, save every 25
  • Visual tokens: 4096
  • Text warmup: 50 steps text-only before image training

Eval results (mini-v8, 400 queries / 7426 tiles)

Step R@1 R@3 Qwen3-VL-4B (no-think) Qwen3.5-4B (no-think) Qwen3.5-4B (think mt=8192)
125 0.7425 0.8750 β€” 0.7800 0.8400
150 β˜… 0.7575 0.8825 0.7875 0.7875 0.8525
175 0.7550 0.8750 β€” 0.7800 0.8375
200 0.7525 0.8750 β€” 0.7775 0.8325
225 0.7675 0.8825 0.7800 0.7725 0.8450
250 0.7575 0.8850 0.7750 0.7750 0.8325

Why thinking helps so much

The Qwen3.5-4B reader is a reasoning model. With enable_thinking=True and large max_tokens (β‰₯8192) the answer quality jumps substantially:

Reader config Base (no LoRA) Best ckpt (dora_ls005/ckpt150)
Qwen3.5 no-think mt=200 0.7300 0.7875 (+5.75%)
Qwen3.5 think mt=8192 0.7925 (+6.25%) 0.8525 (+12.25%)

max_tokens sensitivity for dora_ls005/ckpt150 + thinking:

  • mt=2048 β†’ 0.8175
  • mt=4096 β†’ 0.8325
  • mt=8192 β†’ 0.8525 ← sweet spot
  • mt=16384 β†’ 0.8500 (plateau)

β†’ must use max_tokens β‰₯ 8192 for thinking, otherwise reasoning gets truncated.


lora_vit (v8_r_warmup50_lr7e6_lora_vit_350) β€” original Best

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32, target includes ViT layers (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
  • LR: 7e-6, cosine schedule, warmup 50 steps
  • Batch size: 256 effective
  • Max steps: 350
  • Visual tokens: 4096

Eval Results

Step v6 recall@1 v6 recall@3 v6 QA v6 vs base v8 recall@1 v8 recall@3 v8 QA v8 vs base
base 0.655 0.800 0.645 β€” 0.690 0.828 0.708 β€”
100 0.690 0.855 0.715 +0.070 0.725 0.870 0.760 +0.052
150 0.720 0.870 0.735 +0.090 0.750 0.888 0.780 +0.072
200 0.740 0.860 0.750 +0.105 0.753 0.890 0.793 +0.085

Best: ckpt200 β€” v8 QA = 0.793 (+8.5%), v6 QA = 0.750 (+10.5%)

hyper3 (v8_i_warmup50_lr7e6_hardswitch_350)

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32 (LLM layers only, no ViT)
  • LR: 7e-6, cosine schedule, warmup 50 steps
  • Batch size: 256 effective
  • Max steps: 350 (hard switch)
  • Visual tokens: 4096

Eval Results

Step v6 QA v6 vs base v8 QA v8 vs base
base 0.645 β€” 0.708 β€”
100 0.715 +0.070 0.745 +0.038
150 0.710 +0.065 0.748 +0.040
200 0.725 +0.080 0.753 +0.045
250 0.715 +0.070 0.770 +0.063
300 0.715 +0.070 0.763 +0.055
350 0.715 +0.070 0.763 +0.055

Best: ckpt250 (v8 QA = 0.770, +6.3% over base)

Usage

from peft import PeftModel
from transformers import AutoModel

base = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")

# Best overall (DoRA + LS=0.05): wins QA on every reader
model = PeftModel.from_pretrained(
    base,
    "Chrisyichuan/wiki-screenshot-embedding-lora",
    subfolder="dora_ls005/ckpt150",
)

DoRA is auto-detected from adapter_config.json (use_dora: true) β€” no extra code needed; PEFT loads lora_A, lora_B, and lora_magnitude_vector automatically.

Eval Benchmarks

  • v6: 200 queries, 5291 tiles (hard-mini-v6)
  • v8: 400 queries, 7426 tiles (hard-mini-v8, preferred benchmark)
  • QA score pipeline: retrieval top-3 β†’ VQA reader (Qwen3-VL-4B-Instruct or Qwen3.5-4B) β†’ GPT-4.1 grader (correct/incorrect)
  • For Qwen3.5 reader: enable_thinking=True + max_tokens=8192 recommended; enable_thinking=False + max_tokens=200 is the fast/cheap baseline.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Chrisyichuan/wiki-screenshot-embedding-lora

Adapter
(4)
this model