📄 Paper | 📝 OriOn Blog | 🔧 Pipeline Code | 📊 Benchmark (MMLBD-C) | 🪐 OriOn Collection
OriOn-Qwen Synthetic Reasoning 1
SOTA on MMLongBenchDoc (58.3), surpassing a 7x larger model. This checkpoint extends OriOn-Qwen with synthetic reasoning traces that are internalized via low-strength model merging, achieving frontier long-document QA performance with no increase in inference cost.
TL;DR
We introduce a synthetic reasoning pipeline for long-document VQA: score every page for question relevance, extract evidence, keep the top-K pages sorted by relevance, and use this as a structured <think> trace during SFT. Low-strength model merging (α=0.25) then internalizes the reasoning: the model does not generate explicit thinking tokens, yet retains the full performance benefit. A <cot> control token gates the capability at inference time. The result is a 32B model that beats Qwen3-VL-235B-A22B-Instruct on MMLongBenchDoc while producing only ~250 mean output tokens.
Highlights
- SOTA on MMLongBenchDoc with 58.3 accuracy, surpassing
Qwen3-VL-235B-A22B-Instruct(57.0) andThinking(56.2) with 7x fewer parameters - Internalized reasoning via low-strength model merging: no
<think>tokens emitted, yet full performance retained - Controllable: place
<cot>in the system prompt to activate reasoning (+3.8 MMLBD when on vs. off) - Drop-in replacement for
Qwen/Qwen3-VL-32B-Instruct: sameQwen3VLForConditionalGeneration+AutoProcessorAPI
How It Works
This checkpoint builds on OriOn and extends it with synthetic reasoning traces (paper).
Synthetic reasoning pipeline
Given a document of N pages and a question Q:
- Evidence extraction & scoring: an extractor VLM (
Qwen3-VL-32B-Instruct) processes each page independently, producing a relevance score ([0, 10]) and a natural-language evidence snippet. - Top-K selection: pages below threshold are dropped, the top-K (default 24) are kept and sorted by relevance.
- Answer generation through two parallel branches: a visual branch (teacher VLM receives top-ranked page images) and a text branch (teacher LLM receives only the extracted evidence). Training examples are drawn equally from both.
The relevance-sorted evidence is placed inside <think> tags, gated by a <cot> control token (present in 95% of training examples).
Internalization via model merging
The final checkpoint is produced by task arithmetic: θ_merged = θ_base + α · (θ_SFT − θ_base). At α=0.25, the model does not emit thinking tokens and its mean output length is comparable to a non-reasoning baseline, yet it retains the full performance gains. Increasing α to 0.5 shifts the model to explicit reasoning with 12.4x more output tokens.
Why trace design matters
An earlier v1 pipeline visited every page sequentially, marking irrelevant ones, teaching a pathological looping algorithm. The v2 redesign (bounded top-K, relevance-ordered, no irrelevant markers) eliminates the failure mode and yields substantial gains across all primary metrics.
Related
| Resource | Description |
|---|---|
| OriOn-Qwen | Base OriOn checkpoint (LongPO, no reasoning) |
| OriOn-Mistral | Mistral variant with +16.8% MMLBD improvement |
| MMLBD-C | Manually corrected MMLongBenchDoc benchmark |
| Pipeline Code | Synthetic reasoning pipeline (Apache 2.0 fork of distilabel) |
Benchmarks
Official MMLongBenchDoc leaderboard
| Model | Acc | Params |
|---|---|---|
| OriOn-Qwen-SR1 (this model) | 58.3 | 32B |
| Qwen3-VL-235B-A22B-Instruct | 57.0 | 235B (22B active) |
| Qwen3-VL-235B-A22B-Thinking | 56.2 | 235B (22B active) |
| TeleMM-2.0 | 56.1 | – |
| Qwen3-VL-32B-Instruct | 55.4 | 32B |
| GLM-4.6V | 54.9 | 106B (12B active) |
| GPT-4o | 46.3 | – |
Full benchmark suite (Qwen3-VL family)
Deltas are relative to the Qwen3-VL-32B-Instruct base model.
| Model | VA | LCA | MMLBD | MMLBD-C | MMLB 128K | SlideVQA | HELMET | DUDE |
|---|---|---|---|---|---|---|---|---|
| 235B-A22B-Instruct | 98.4 | 98.5 | 54.8 | 56.2 | 78.6 | 84.5 | 67.6 | 59.1 |
| OriOn-Qwen-SR1 (this model) | 95.0 (+1.3) | 94.4 (+2.3) | 55.8 (+4.0) | 58.2 (+4.4) | 75.7 (+5.3) | 75.4 (-1.8) | 68.5 (+5.5) | 55.1 (-6.7) |
| LongPO (OriOn-Qwen) | 94.0 (+0.3) | 92.4 (+0.3) | 53.6 (+1.8) | 56.4 (+2.6) | 75.6 (+5.2) | 75.5 (-1.7) | 62.9 (-0.1) | 56.0 (-5.8) |
| 32B-Instruct (base) | 93.7 | 92.1 | 51.8 | 53.8 | 70.4 | 77.2 | 63.0 | 61.8 |
VA = Visual-LC Average (MMLBD, MMLBD-C, MMLongBench, DUDE, SlideVQA). LCA = VA + HELMET + LongBench v2. See the paper for full results including Mistral, control-token ablations and trace-design comparisons.
Reasoning Behavior
Place <cot> at the beginning of the system prompt to activate internalized reasoning. This improves performance with only a slight increase in output tokens.
System: <cot>
User: What is the average revenue growth across all subsidiaries mentioned in pages 12-45?
Without <cot>, the model still works but performance degrades (e.g. -3.8 MMLBD for Qwen). The model does not emit <think> tokens at α=0.25; the reasoning is internalized.
Intended Use
This checkpoint is designed for:
- Long PDF and slide-deck question answering (up to 250+ pages in a single pass)
- Multi-page document reasoning requiring cross-page synthesis
- Long-context visual document understanding in enterprise, legal, scientific and financial domains
This is a research checkpoint that retains most of Qwen/Qwen3-VL-32B-Instruct's general capabilities while significantly improving long-document performance.
Usage with Transformers
This model uses the same API as Qwen/Qwen3-VL-32B-Instruct:
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model_id = "lightonai/OriOn-Qwen-SR1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Multi-page document QA with <cot> reasoning
messages = [
{"role": "system", "content": "<cot>"},
{
"role": "user",
"content": [
{"type": "image", "url": "page1.png"},
{"type": "image", "url": "page2.png"},
# ... add all document pages
{"type": "text", "text": "What are the key findings discussed across this document?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(device)
output_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))
Usage with vLLM
vllm serve lightonai/OriOn-Qwen-SR1
import base64
import io
import requests
import pypdfium2 as pdfium
ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/OriOn-Qwen-SR1"
# Load and render a multi-page PDF
pdf_data = requests.get("https://arxiv.org/pdf/2412.13663").content
pdf = pdfium.PdfDocument(pdf_data)
# Convert pages to base64 images
page_images = []
for i in range(min(len(pdf), 50)): # cap at 50 pages for this example
pil_image = pdf[i].render(scale=2.77).to_pil()
buffer = io.BytesIO()
pil_image.save(buffer, format="PNG")
b64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
page_images.append({
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64}"},
})
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": "<cot>"},
{
"role": "user",
"content": [
*page_images,
{"type": "text", "text": "Summarize the main contributions of this paper."},
],
},
],
"max_tokens": 4096,
"temperature": 0.2,
}
response = requests.post(ENDPOINT, json=payload)
print(response.json()["choices"][0]["message"]["content"])
Model Details
| Base model | Qwen/Qwen3-VL-32B-Instruct |
| Architecture | Qwen3VLForConditionalGeneration |
| Context length | 262,144 tokens |
| Tensor type | bfloat16 |
| Processor | Qwen3VLProcessor / AutoProcessor |
| Image processor | Qwen2VLImageProcessorFast |
| Training | SFT on 50K synthetic reasoning examples + external SFT data (Luth, Smoltalk2) |
| Merge strength | α = 0.25 (task arithmetic with CPT + SFT vectors) |
| Compute | ~40K H100 hours (main training), ~100K H100 hours (project total incl. eval and data gen) |
License
Apache License 2.0
Citation
If you use this checkpoint, please cite both papers:
@misc{long_document_internalized_reasoning,
title={Internalized Reasoning for Long-Context Visual Document Understanding},
author={Austin Veselka},
year={2026},
eprint={2604.02371},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02371},
}
@misc{long_document_training,
title={How to Train Your Long-Context Visual Document Model},
author={Austin Veselka},
year={2026},
eprint={2602.15257},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.15257},
}
- Downloads last month
- 56
Model tree for lightonai/OriOn-Qwen-SR1
Base model
Qwen/Qwen3-VL-32B-Instruct