LightOnOCR-2-1B-sam-44-mss-alb
Finetuned OCR model for Medieval Samaritan Hebrew & Samaritan Aramaic Manuscripts
Overview
LightOnOCR-2-1B-sam-44-mss-alb is a specialized OCR model finetuned from lightonai/LightOnOCR-2-1B-base.
It is trained on 44 medieval Samaritan Hebrew and Samaritan Aramaic manuscripts, using mild FineVision albumentations to improve robustness to real-world degradation such as ink fading, page warping, and historical noise.
This model is designed for high-accuracy OCR on historical Samaritan scripts and performs well on both clean and degraded manuscript images.
Base model
- Base:
lightonai/LightOnOCR-2-1B-base - Architecture: VisionโLanguage OCR (encoder โ mmproj โ decoder)
- Context length: 16,384 tokens
- Training precision:
bf16 - Image preprocessing:
longest_edge = 1540
Training dataset
This model was trained on:
samaritan-ai/sam_44_mss_albumentations
Dataset characteristics
- 44 medieval manuscripts
- Languages:
- Samaritan Hebrew
- Samaritan Aramaic
- FineVision albumentations:
- mild geometric distortions
- blur, noise, sharpening
- contrast/brightness variation
- realistic manuscript degradation
- Aligned SAM-based segmentation
- 275k+ training samples
Training details
- Frameworks:
- Unsloth
2026.1.4 - Transformers
5.1.0 - PEFT + LoRA
- Unsloth
- Precision:
bf16 - Image preprocessing:
longest_edge = 1540 - Early stopping:
patience: 3eval_steps: 1000patience_in_steps: 3000
Evaluation results
Before vs after finetuning
| Metric | Before training | After training |
|---|---|---|
| CER | 721.54 | 1.64 |
| WER | 524.51 | 6.17 |
| Perfect matches | 0 / 100 | 78 / 100 |
JSON summary
{
"before_training": {
"cer": 721.5432960893855,
"wer": 524.512987012987,
"perfect_matches": 0,
"total_samples": 100
},
"after_training": {
"cer": 1.6410614525139664,
"wer": 6.1688311688311686,
"perfect_matches": 78,
"total_samples": 100
},
"early_stopping": {
"patience": 3,
"eval_steps": 1000,
"patience_in_steps": 3000
}
}
GGUF weights
This model uses the standard multimodal GGUF layout:
LLM weights + mmproj weights (two files per quantization).
LLM weights
LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf (1.19 GB)
LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf (2.39 GB)
LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf (0.63 GB)
mmproj weights
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f16.gguf (0.82 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf (1.63 GB)
mmproj-LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf (0.45 GB)
Note:
Multimodal GGUF models must remain split into LLM + mmproj.
Merging is not supported by GGUF or llama.cpp.
Ollama Modelfile example
FROM ./LightOnOCR-2-1B-sam-44-mss-alb-q8_0.gguf
FROM ./mmproj-LightOnOCR-2-1B-sam-44-mss-alb-f32.gguf
TEMPLATE {{ .Prompt }}
RENDERER qwen3-vl-instruct
PARSER qwen3-vl-instruct
PARAMETER top_p 0.9
PARAMETER temperature 0.2
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
Usage (Python)
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")
model = AutoModelForVision2Seq.from_pretrained("samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb")
img = Image.open("page.png")
inputs = processor(images=img, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))
License
- Base model: same license as
lightonai/LightOnOCR-2-1B-base - Dataset: see
samaritan-ai/sam_44_mss_albumentations - This finetuned model: same license as base unless otherwise specified
Acknowledgements
- LightOnAI for the base model
- Unsloth for efficient finetuning
- FineVision for augmentation tools
- The Samaritan manuscript preservation community
- Downloads last month
- 17
8-bit
16-bit
32-bit
Model tree for samaritan-ai/LightOnOCR-2-1B-sam-44-mss-alb-GGUF
Base model
lightonai/LightOnOCR-2-1B-base