Wiki Screenshot Embedding LoRA Checkpoints

LoRA / DoRA adapter checkpoints for Qwen3-VL-Embedding-2B, fine-tuned on Wikipedia screenshot tiles for visual document retrieval.

All evals on mini-v8 (400 queries, 7426 tiles).

Headline (Cross-reader / Cross-thinking comparison @ best ckpts)

ckpt	R@1	R@3	Qwen3-VL-4B reader (no-think mt=200)	Qwen3.5-4B no-think mt=200	Qwen3.5-4B think mt=8192
Base (no LoRA)	0.6875	0.830	—	0.7300	0.7925
`hyper3/ckpt250` (LLM-only LoRA)	0.7000	0.840	0.7775	0.7700	0.8225
`lora_vit/ckpt200`	0.7575	0.8900	0.7875	0.7850	0.8375
`lora_vit/ckpt250`	0.7625	0.8925	0.7800	—	0.8375
`lora_vit/ckpt300`	0.7625	0.8925	0.7825	—	—
`dora_ls005/ckpt150` ★	0.7575	0.8825	0.7875	0.7875	0.8525
`dora_ls005/ckpt225`	0.7675	0.8825	0.7800	0.7725	0.8450
`dora_ls005/ckpt250`	0.7575	0.8850	0.7750	0.7750	0.8325

Best per metric:

R@1: dora_ls005/ckpt225 (0.7675)
R@3: lora_vit/ckpt250 / ckpt300 (0.8925)
Qwen3-VL no-think QA: dora_ls005/ckpt150 / lora_vit/ckpt200 tied (0.7875)
Qwen3.5 no-think QA: dora_ls005/ckpt150 (0.7875)
Qwen3.5 think QA (mt=8192): dora_ls005/ckpt150 (0.8525) ★ overall SOTA

dora_ls005/ckpt150 is the most well-rounded — wins all 3 QA metrics, only marginally behind on retrieval (R@1 -0.01, R@3 -0.01).

Training Configs

dora_ls005 ★ NEW — best QA across all readers

v8r-style + DoRA + label smoothing 0.05 + tokens=4096

Base model: Qwen/Qwen3-VL-Embedding-2B
Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
LoRA: rank 32, alpha 32, DoRA enabled (use_dora: true), targets LLM (q/k/v/o_proj) + ViT (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
Loss: InfoNCE + label smoothing = 0.05
LR: 7e-6, cosine schedule, warmup 20 (image) + 50 (text warmup)
Batch size: 64 (single GPU; effective batch 64 — smaller than lora_vit's 256)
Max steps: 350, save every 25
Visual tokens: 4096
Text warmup: 50 steps text-only before image training

Eval results (mini-v8, 400 queries / 7426 tiles)

Step	R@1	R@3	Qwen3-VL-4B (no-think)	Qwen3.5-4B (no-think)	Qwen3.5-4B (think mt=8192)
125	0.7425	0.8750	—	0.7800	0.8400
150 ★	0.7575	0.8825	0.7875	0.7875	0.8525
175	0.7550	0.8750	—	0.7800	0.8375
200	0.7525	0.8750	—	0.7775	0.8325
225	0.7675	0.8825	0.7800	0.7725	0.8450
250	0.7575	0.8850	0.7750	0.7750	0.8325

Why thinking helps so much

The Qwen3.5-4B reader is a reasoning model. With enable_thinking=True and large max_tokens (≥8192) the answer quality jumps substantially:

Reader config	Base (no LoRA)	Best ckpt (`dora_ls005/ckpt150`)
Qwen3.5 no-think mt=200	0.7300	0.7875 (+5.75%)
Qwen3.5 think mt=8192	0.7925 (+6.25%)	0.8525 (+12.25%)

max_tokens sensitivity for dora_ls005/ckpt150 + thinking:

mt=2048 → 0.8175
mt=4096 → 0.8325
mt=8192 → 0.8525 ← sweet spot
mt=16384 → 0.8500 (plateau)

→ must use max_tokens ≥ 8192 for thinking, otherwise reasoning gets truncated.

lora_vit (v8_r_warmup50_lr7e6_lora_vit_350) — original Best

Base model: Qwen/Qwen3-VL-Embedding-2B
Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
LoRA: rank 32, alpha 32, target includes ViT layers (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
LR: 7e-6, cosine schedule, warmup 50 steps
Batch size: 256 effective
Max steps: 350
Visual tokens: 4096

Eval Results

Step	v6 recall@1	v6 recall@3	v6 QA	v6 vs base	v8 recall@1	v8 recall@3	v8 QA	v8 vs base
base	0.655	0.800	0.645	—	0.690	0.828	0.708	—
100	0.690	0.855	0.715	+0.070	0.725	0.870	0.760	+0.052
150	0.720	0.870	0.735	+0.090	0.750	0.888	0.780	+0.072
200	0.740	0.860	0.750	+0.105	0.753	0.890	0.793	+0.085

Best: ckpt200 — v8 QA = 0.793 (+8.5%), v6 QA = 0.750 (+10.5%)

hyper3 (v8_i_warmup50_lr7e6_hardswitch_350)

Base model: Qwen/Qwen3-VL-Embedding-2B
Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
LoRA: rank 32, alpha 32 (LLM layers only, no ViT)
LR: 7e-6, cosine schedule, warmup 50 steps
Batch size: 256 effective
Max steps: 350 (hard switch)
Visual tokens: 4096

Eval Results

Step	v6 QA	v6 vs base	v8 QA	v8 vs base
base	0.645	—	0.708	—
100	0.715	+0.070	0.745	+0.038
150	0.710	+0.065	0.748	+0.040
200	0.725	+0.080	0.753	+0.045
250	0.715	+0.070	0.770	+0.063
300	0.715	+0.070	0.763	+0.055
350	0.715	+0.070	0.763	+0.055

Best: ckpt250 (v8 QA = 0.770, +6.3% over base)

Usage

from peft import PeftModel
from transformers import AutoModel

base = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")

# Best overall (DoRA + LS=0.05): wins QA on every reader
model = PeftModel.from_pretrained(
    base,
    "Chrisyichuan/wiki-screenshot-embedding-lora",
    subfolder="dora_ls005/ckpt150",
)

DoRA is auto-detected from adapter_config.json (use_dora: true) — no extra code needed; PEFT loads lora_A, lora_B, and lora_magnitude_vector automatically.

Eval Benchmarks

v6: 200 queries, 5291 tiles (hard-mini-v6)
v8: 400 queries, 7426 tiles (hard-mini-v8, preferred benchmark)
QA score pipeline: retrieval top-3 → VQA reader (Qwen3-VL-4B-Instruct or Qwen3.5-4B) → GPT-4.1 grader (correct/incorrect)
For Qwen3.5 reader: enable_thinking=True + max_tokens=8192 recommended; enable_thinking=False + max_tokens=200 is the fast/cheap baseline.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chrisyichuan/wiki-screenshot-embedding-lora

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-2B

Adapter

(4)

this model