LightOnOCR-2-1B-sam-44-mss-alb

Finetuned OCR model for Medieval Samaritan Hebrew & Samaritan Aramaic Manuscripts

Overview

LightOnOCR-2-1B-sam-44-mss-alb is a specialized OCR model finetuned from lightonai/LightOnOCR-2-1B-base.
It is trained on 44 medieval Samaritan Hebrew and Samaritan Aramaic manuscripts, using mild FineVision albumentations to improve robustness to real-world degradation such as ink fading, page warping, and historical noise.

This model is designed for high-accuracy OCR on historical Samaritan scripts and performs well on both clean and degraded manuscript images.

Base model

Base: lightonai/LightOnOCR-2-1B-base
Architecture: Vision–Language OCR (encoder → mmproj → decoder)
Context length: 16,384 tokens
Training precision: bf16
Image preprocessing: longest_edge = 1540

Training dataset

This model was trained on:

samaritan-ai/sam_44_mss_albumentations

Dataset characteristics

44 medieval manuscripts
Languages:
- Samaritan Hebrew
- Samaritan Aramaic
FineVision albumentations:
- mild geometric distortions
- blur, noise, sharpening
- contrast/brightness variation
- realistic manuscript degradation
Aligned SAM-based segmentation
275k+ training samples

Training details

Frameworks:
- Unsloth 2026.1.4
- Transformers 5.1.0
- PEFT + LoRA
Precision: bf16
Image preprocessing: longest_edge = 1540
Early stopping:
- patience: 3
- eval_steps: 1000
- patience_in_steps: 3000

Evaluation results

Before vs after finetuning

Metric	Before training	After training
CER	721.54	1.64
WER	524.51	6.17
Perfect matches	0 / 100	78 / 100

JSON summary

{
  "before_training": {
    "cer": 721.5432960893855,
    "wer": 524.512987012987,
    "perfect_matches": 0,
    "total_samples": 100
  },
  "after_training": {
    "cer": 1.6410614525139664,
    "wer": 6.1688311688311686,
    "perfect_matches": 78,
    "total_samples": 100
  },
  "early_stopping": {
    "patience": 3,
    "eval_steps": 1000,
    "patience_in_steps": 3000
  }
}

GGUF weights

This model uses the standard multimodal GGUF layout:
LLM weights + mmproj weights (two files per quantization).

LLM weights

LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf   (1.19 GB)
LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf   (2.39 GB)
LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf  (0.63 GB)

mmproj weights

mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf   (0.82 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf   (1.63 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf  (0.45 GB)

Note:
Multimodal GGUF models must remain split into LLM + mmproj.
Merging is not supported by GGUF or llama.cpp.

Ollama Modelfile example

FROM ./LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf
FROM ./mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf

TEMPLATE {{ .Prompt }}
RENDERER qwen3-vl-instruct
PARSER qwen3-vl-instruct

PARAMETER top_p 0.9
PARAMETER temperature 0.2
PARAMETER num_ctx 16384
PARAMETER num_predict 4096

Usage (Python)

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")
model = AutoModelForVision2Seq.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")

img = Image.open("page.png")
inputs = processor(images=img, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))

License

Base model: same license as lightonai/LightOnOCR-2-1B-base
Dataset: see samaritan-ai/sam_44_mss_albumentations
This finetuned model: same license as base unless otherwise specified

Acknowledgements

LightOnAI for the base model
Unsloth for efficient finetuning
FineVision for augmentation tools
The Samaritan manuscript preservation community

Downloads last month: 17

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

8-bit

16-bit

32-bit

Model tree for samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb-GGUF

Base model

lightonai/LightOnOCR-2-1B-base

Quantized

(2)

this model