LightOnOCR Banner

LightOnOCR-2-1B-sam-44-mss-alb

Finetuned OCR model for Medieval Samaritan Hebrew & Samaritan Aramaic Manuscripts


Overview

LightOnOCR-2-1B-sam-44-mss-alb is a specialized OCR model finetuned from lightonai/LightOnOCR-2-1B-base.
It is trained on 44 medieval Samaritan Hebrew and Samaritan Aramaic manuscripts, using mild FineVision albumentations to improve robustness to real-world degradation such as ink fading, page warping, and historical noise.

This model is designed for high-accuracy OCR on historical Samaritan scripts and performs well on both clean and degraded manuscript images.


Base model

  • Base: lightonai/LightOnOCR-2-1B-base
  • Architecture: Visionโ€“Language OCR (encoder โ†’ mmproj โ†’ decoder)
  • Context length: 16,384 tokens
  • Training precision: bf16
  • Image preprocessing: longest_edge = 1540

Training dataset

This model was trained on:

  • samaritan-ai/sam_44_mss_albumentations

Dataset characteristics

  • 44 medieval manuscripts
  • Languages:
    • Samaritan Hebrew
    • Samaritan Aramaic
  • FineVision albumentations:
    • mild geometric distortions
    • blur, noise, sharpening
    • contrast/brightness variation
    • realistic manuscript degradation
  • Aligned SAM-based segmentation
  • 275k+ training samples

Training details

  • Frameworks:
    • Unsloth 2026.1.4
    • Transformers 5.1.0
    • PEFT + LoRA
  • Precision: bf16
  • Image preprocessing: longest_edge = 1540
  • Early stopping:
    • patience: 3
    • eval_steps: 1000
    • patience_in_steps: 3000

Evaluation results

Before vs after finetuning

Metric Before training After training
CER 721.54 1.64
WER 524.51 6.17
Perfect matches 0 / 100 78 / 100

JSON summary

{
  "before_training": {
    "cer": 721.5432960893855,
    "wer": 524.512987012987,
    "perfect_matches": 0,
    "total_samples": 100
  },
  "after_training": {
    "cer": 1.6410614525139664,
    "wer": 6.1688311688311686,
    "perfect_matches": 78,
    "total_samples": 100
  },
  "early_stopping": {
    "patience": 3,
    "eval_steps": 1000,
    "patience_in_steps": 3000
  }
}

GGUF weights

This model uses the standard multimodal GGUF layout:
LLM weights + mmproj weights (two files per quantization).

LLM weights

LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf   (1.19 GB)
LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf   (2.39 GB)
LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf  (0.63 GB)

mmproj weights

mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf   (0.82 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf   (1.63 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf  (0.45 GB)

Note:
Multimodal GGUF models must remain split into LLM + mmproj.
Merging is not supported by GGUF or llama.cpp.


Ollama Modelfile example

FROM ./LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf
FROM ./mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf

TEMPLATE {{ .Prompt }}
RENDERER qwen3-vl-instruct
PARSER qwen3-vl-instruct

PARAMETER top_p 0.9
PARAMETER temperature 0.2
PARAMETER num_ctx 16384
PARAMETER num_predict 4096

Usage (Python)

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")
model = AutoModelForVision2Seq.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")

img = Image.open("page.png")
inputs = processor(images=img, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))

License

  • Base model: same license as lightonai/LightOnOCR-2-1B-base
  • Dataset: see samaritan-ai/sam_44_mss_albumentations
  • This finetuned model: same license as base unless otherwise specified

Acknowledgements

  • LightOnAI for the base model
  • Unsloth for efficient finetuning
  • FineVision for augmentation tools
  • The Samaritan manuscript preservation community
Downloads last month
17
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb-GGUF

Quantized
(2)
this model