📄 Paper | 📝 OriOn Blog | 🔧 Pipeline Code | 📊 Benchmark (MMLBD-C) | 🪐 OriOn Collection

OriOn-Qwen Synthetic Reasoning 1

SOTA on MMLongBenchDoc (58.3), surpassing a 7x larger model. This checkpoint extends OriOn-Qwen with synthetic reasoning traces that are internalized via low-strength model merging, achieving frontier long-document QA performance with no increase in inference cost.

TL;DR

We introduce a synthetic reasoning pipeline for long-document VQA: score every page for question relevance, extract evidence, keep the top-K pages sorted by relevance, and use this as a structured <think> trace during SFT. Low-strength model merging (α=0.25) then internalizes the reasoning: the model does not generate explicit thinking tokens, yet retains the full performance benefit. A <cot> control token gates the capability at inference time. The result is a 32B model that beats Qwen3-VL-235B-A22B-Instruct on MMLongBenchDoc while producing only ~250 mean output tokens.

Highlights

SOTA on MMLongBenchDoc with 58.3 accuracy, surpassing Qwen3-VL-235B-A22B-Instruct (57.0) and Thinking (56.2) with 7x fewer parameters
Internalized reasoning via low-strength model merging: no <think> tokens emitted, yet full performance retained
Controllable: place <cot> in the system prompt to activate reasoning (+3.8 MMLBD when on vs. off)
Drop-in replacement for Qwen/Qwen3-VL-32B-Instruct: same Qwen3VLForConditionalGeneration + AutoProcessor API

How It Works

This checkpoint builds on OriOn and extends it with synthetic reasoning traces (paper).

Synthetic reasoning pipeline

Given a document of N pages and a question Q:

Evidence extraction & scoring: an extractor VLM (Qwen3-VL-32B-Instruct) processes each page independently, producing a relevance score ([0, 10]) and a natural-language evidence snippet.
Top-K selection: pages below threshold are dropped, the top-K (default 24) are kept and sorted by relevance.
Answer generation through two parallel branches: a visual branch (teacher VLM receives top-ranked page images) and a text branch (teacher LLM receives only the extracted evidence). Training examples are drawn equally from both.

The relevance-sorted evidence is placed inside <think> tags, gated by a <cot> control token (present in 95% of training examples).

Internalization via model merging

The final checkpoint is produced by task arithmetic: θ_merged = θ_base + α · (θ_SFT − θ_base). At α=0.25, the model does not emit thinking tokens and its mean output length is comparable to a non-reasoning baseline, yet it retains the full performance gains. Increasing α to 0.5 shifts the model to explicit reasoning with 12.4x more output tokens.

Why trace design matters

An earlier v1 pipeline visited every page sequentially, marking irrelevant ones, teaching a pathological looping algorithm. The v2 redesign (bounded top-K, relevance-ordered, no irrelevant markers) eliminates the failure mode and yields substantial gains across all primary metrics.

Resource	Description
OriOn-Qwen	Base OriOn checkpoint (LongPO, no reasoning)
OriOn-Mistral	Mistral variant with +16.8% MMLBD improvement
MMLBD-C	Manually corrected MMLongBenchDoc benchmark
Pipeline Code	Synthetic reasoning pipeline (Apache 2.0 fork of distilabel)

Benchmarks

Official MMLongBenchDoc leaderboard

Model	Acc	Params
OriOn-Qwen-SR1 (this model)	58.3	32B
Qwen3-VL-235B-A22B-Instruct	57.0	235B (22B active)
Qwen3-VL-235B-A22B-Thinking	56.2	235B (22B active)
TeleMM-2.0	56.1	–
Qwen3-VL-32B-Instruct	55.4	32B
GLM-4.6V	54.9	106B (12B active)
GPT-4o	46.3	–

Full benchmark suite (Qwen3-VL family)

Deltas are relative to the Qwen3-VL-32B-Instruct base model.

Model	VA	LCA	MMLBD	MMLBD-C	MMLB 128K	SlideVQA	HELMET	DUDE
235B-A22B-Instruct	98.4	98.5	54.8	56.2	78.6	84.5	67.6	59.1
OriOn-Qwen-SR1 (this model)	95.0 (+1.3)	94.4 (+2.3)	55.8 (+4.0)	58.2 (+4.4)	75.7 (+5.3)	75.4 (-1.8)	68.5 (+5.5)	55.1 (-6.7)
LongPO (OriOn-Qwen)	94.0 (+0.3)	92.4 (+0.3)	53.6 (+1.8)	56.4 (+2.6)	75.6 (+5.2)	75.5 (-1.7)	62.9 (-0.1)	56.0 (-5.8)
32B-Instruct (base)	93.7	92.1	51.8	53.8	70.4	77.2	63.0	61.8

VA = Visual-LC Average (MMLBD, MMLBD-C, MMLongBench, DUDE, SlideVQA). LCA = VA + HELMET + LongBench v2. See the paper for full results including Mistral, control-token ablations and trace-design comparisons.

Reasoning Behavior

Place <cot> at the beginning of the system prompt to activate internalized reasoning. This improves performance with only a slight increase in output tokens.

System: <cot>
User: What is the average revenue growth across all subsidiaries mentioned in pages 12-45?

Without <cot>, the model still works but performance degrades (e.g. -3.8 MMLBD for Qwen). The model does not emit <think> tokens at α=0.25; the reasoning is internalized.

Intended Use

This checkpoint is designed for:

Long PDF and slide-deck question answering (up to 250+ pages in a single pass)
Multi-page document reasoning requiring cross-page synthesis
Long-context visual document understanding in enterprise, legal, scientific and financial domains

This is a research checkpoint that retains most of Qwen/Qwen3-VL-32B-Instruct's general capabilities while significantly improving long-document performance.

Usage with Transformers

This model uses the same API as Qwen/Qwen3-VL-32B-Instruct:

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "lightonai/OriOn-Qwen-SR1"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Multi-page document QA with <cot> reasoning
messages = [
    {"role": "system", "content": "<cot>"},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "page1.png"},
            {"type": "image", "url": "page2.png"},
            # ... add all document pages
            {"type": "text", "text": "What are the key findings discussed across this document?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(device)

output_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))

Usage with vLLM

vllm serve lightonai/OriOn-Qwen-SR1

import base64
import io
import requests
import pypdfium2 as pdfium

ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/OriOn-Qwen-SR1"

# Load and render a multi-page PDF
pdf_data = requests.get("https://arxiv.org/pdf/2412.13663").content
pdf = pdfium.PdfDocument(pdf_data)

# Convert pages to base64 images
page_images = []
for i in range(min(len(pdf), 50)):  # cap at 50 pages for this example
    pil_image = pdf[i].render(scale=2.77).to_pil()
    buffer = io.BytesIO()
    pil_image.save(buffer, format="PNG")
    b64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
    page_images.append({
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{b64}"},
    })

payload = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "<cot>"},
        {
            "role": "user",
            "content": [
                *page_images,
                {"type": "text", "text": "Summarize the main contributions of this paper."},
            ],
        },
    ],
    "max_tokens": 4096,
    "temperature": 0.2,
}

response = requests.post(ENDPOINT, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Model Details


Base model	`Qwen/Qwen3-VL-32B-Instruct`
Architecture	`Qwen3VLForConditionalGeneration`
Context length	262,144 tokens
Tensor type	`bfloat16`
Processor	`Qwen3VLProcessor` / `AutoProcessor`
Image processor	`Qwen2VLImageProcessorFast`
Training	SFT on 50K synthetic reasoning examples + external SFT data (Luth, Smoltalk2)
Merge strength	α = 0.25 (task arithmetic with CPT + SFT vectors)
Compute	~40K H100 hours (main training), ~100K H100 hours (project total incl. eval and data gen)

License

Apache License 2.0

Citation

If you use this checkpoint, please cite both papers:

@misc{long_document_internalized_reasoning,
  title={Internalized Reasoning for Long-Context Visual Document Understanding}, 
  author={Austin Veselka},
  year={2026},
  eprint={2604.02371},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02371}, 
}

@misc{long_document_training,
  title={How to Train Your Long-Context Visual Document Model},
  author={Austin Veselka},
  year={2026},
  eprint={2602.15257},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.15257},
}

Downloads last month: 56

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for lightonai/OriOn-Qwen-SR1

Base model

Qwen/Qwen3-VL-32B-Instruct

Finetuned

(24)

this model

Collection including lightonai/OriOn-Qwen-SR1

OriOn 💫

Collection

Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct • 5 items • Updated 3 days ago • 4

Papers for lightonai/OriOn-Qwen-SR1

Internalized Reasoning for Long-Context Visual Document Understanding

Paper • 2604.02371 • Published 13 days ago

How to Train Your Long-Context Visual Document Model

Paper • 2602.15257 • Published Feb 16 • 1

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 163

lightonai
/

OriOn-Qwen-SR1