PDF-OCR-RL: Qwen3-VL-2B SFT Only (Intermediate Checkpoint)
Fine-tuned Qwen3-VL-2B-Instruct for PDF-to-markdown conversion using Supervised Fine-Tuning (SFT) only.
This is the intermediate SFT checkpoint before GRPO refinement. The best model with GRPO is available at blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo.
This is a LoRA adapter (r=32, alpha=64, 2.18% trainable parameters). Load it on top of the base model using PEFT.
Training Details
This checkpoint was produced by Stage 1 of the two-stage training pipeline.
SFT Training (100 steps)
Teaches the model the image-to-markdown mapping using supervised examples from rendered PDF pages paired with their source markdown.
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size | 2 |
| Max steps | 100 |
| Framework | Unsloth + TRL SFTTrainer |
| Loss curve | 1.295 → 0.78 |
| Grad norm | ~1.85 |
Technical Details
| Detail | Value |
|---|---|
| Base model | unsloth/Qwen3-VL-2B-Instruct (2.15B params) |
| LoRA rank (r) | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.0 |
| Target modules | All linear layers |
| Trainable parameters | 2.18% of total |
| Precision | bf16 |
| Hardware | NVIDIA A40 48GB (RunPod) |
| Dataset | 500 train samples from blazeofchi/pdf-ocr-rl-dataset |
Purpose
This checkpoint serves two purposes:
Intermediate checkpoint for GRPO: This is the starting point for Stage 2 (GRPO refinement). GRPO alone produces near-zero gradients on vision-language models — SFT warm-up is essential.
SFT-only baseline: Compare against the full SFT+GRPO model to measure the contribution of GRPO refinement.
Usage
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
"unsloth/Qwen3-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(
base_model, "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only"
)
processor = AutoProcessor.from_pretrained(
"blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only"
)
image = Image.open("page.png")
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Convert this PDF page to well-structured markdown."}
]}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs,
padding=True, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
result = processor.decode(
output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
)
print(result)
Related Models
| Model | Description |
|---|---|
| blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo | SFT + GRPO (best model) |
| blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only | SFT-only (this model) |
Citation
@misc{pdf-ocr-rl-2026,
title={PDF-OCR-RL: Fine-tuning Vision-Language Models for PDF-to-Markdown with GRPO},
author={Paras Sharma},
year={2026},
url={https://github.com/Parassharmaa/pdf-ocr-rl}
}
License
Apache 2.0 (same as base model)
- Downloads last month
- 1
Model tree for blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only
Base model
Qwen/Qwen3-VL-2B-InstructDataset used to train blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only
Evaluation results
- Heading F1 on pdf-ocr-rl-datasettest set self-reported0.852
- Word F1 on pdf-ocr-rl-datasettest set self-reported0.720
- Edit Distance (Levenshtein) on pdf-ocr-rl-datasettest set self-reported0.745