LightOnOCR-2-1B for Samaritan Hebrew/Aramaic

This model is a fine-tuned version of lightonai/LightOnOCR-2-1B-base specifically trained for page-level OCR of Samaritan manuscripts.

Model Description

Base Model: lightonai/LightOnOCR-2-1B-base
Training Data: samaritan-ai/samaritan_hebrew_LightOnOcr
Task: Page-level text transcription from manuscript images
Language: Samaritan Hebrew/Aramaic (smp/sam)
Architecture: Vision-Language Model (1B parameters)

This is a page-level model - it expects full pages, paragraphs or crops of lines.

Evaluation Results

Test Set Performance

Metric	Base Model	Fine-tuned Model	Improvement
CER (Character Error Rate)	475.89%	7.68%	+468.22% (+98.4%)
WER (Word Error Rate)	341.22%	15.37%	+325.85% (+95.5%)
Perfect Matches	0/50 (0.00%)	37/50 (74.00%)	+74.00%
Character Accuracy	382.84%	59.31%	-323.53%

Model Details

Base Model: lightonai/LightOnOCR-2-1B-base
Fine-tuned Model: LightOnOcr-2_samaritan
Test Samples: 50
Evaluation Date: 2026-01-23 14:19:32

Usage

Installation

# Requires transformers from source
pip install git+https://github.com/huggingface/transformers
pip install pillow torch

Python Usage

import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
from PIL import Image

# Load model and processor
model_id = "johnlockejrr/LightOnOCR-2-1B-base-samaritan"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = LightOnOcrProcessor.from_pretrained(model_id)
model = LightOnOcrForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
).to(device)

# Load your line image
image = Image.open("your_line_image.jpg").convert("RGB")

# Prepare input
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 1024},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

# Generate transcription
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)

# Decode output
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0, input_length:]
transcription = processor.decode(generated_ids, skip_special_tokens=True)

print(transcription)

Batch Inference

from datasets import load_dataset

# Load dataset
dataset = load_dataset("johnlockejrr/LightOnOCR-2-1B-base-samaritan", split="train[:10]")

# Process batch
images = [[img.convert("RGB")] for img in dataset["image"]]
messages = [{"role": "user", "content": [{"type": "image"}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
texts = [text] * len(images)

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    size={"longest_edge": 1024},
).to(device)
inputs["pixel_values"] = inputs["pixel_values"].to(dtype)

outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
predictions = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

for pred, gt in zip(predictions, dataset["text"]):
    print(f"Prediction: {pred}")
    print(f"Ground Truth: {gt}")
    print()

Using with Ollama

create the model:

$ ollama create lightonocr-2-base-samaritan -f Modelfile
gathering model components
copying file sha256:a0fefbcf2062da1905fc30ef31cf7341ed743fdb4613cdb5d595aa6dfef5e922 100%
copying file sha256:cdaa68fe9a5f0f54ee2be6e0bb819ab92878ba09a2ee3b204f02766e40a98e58 100%
parsing GGUF
using existing layer sha256:cdaa68fe9a5f0f54ee2be6e0bb819ab92878ba09a2ee3b204f02766e40a98e58
using existing layer sha256:a0fefbcf2062da1905fc30ef31cf7341ed743fdb4613cdb5d595aa6dfef5e922
creating new layer sha256:b507b9c2f6ca642bffcd06665ea7c91f235fd32daeefdf875a0f938db05fb315
creating new layer sha256:15b86ef1f86950774cb27b0bcdb123dd1b4267b6475fdc58f00635fc49a67bca
writing manifest
success

run the model:

$ ollama run lightonocr-2-base-samaritan "Transcribe this image" ./882241214-0327.jpg
Added image './882241214-0327.jpg'
ובבא משה אל אהל מועד לדבר אתו וישמע את
הקול מדבר אליו מעל הכפרת אשר על ארון
העדות מבין שני הכרובים וידבר אליו
וידבר יהוה אל משה לאמר דבר אל אהרן
ואמרת אליו בעלותך את הנרות אל מול פני
המנורה יאירו שבעת הנרות ויעש כן אהרן
אל מול פני המנורה העלה נירתיה כאשר
צוה יהוה את משה וזה מעשה המנורה מקשה
זהב עד ירכיה ועד פרחיה מקשה היא כמראה
אשר הראה יהוה את משה כן עשה את המנורה
וידבר יהוה אל משה לאמר קח את הלוים
מתוך בני ישראל וטהרת אתם וכה תעשה להם
לטהורם הזה עליהם מי חטאת והעבירו תער
על כל בשרם וכבסו בגדיהם והתעורו ולקחו
פר בן בקר ומנחתו סלת בלולה בשמן
ופר שני בן בקר תקח לחטאת והקרבת את
הלוים לפני אהל מועד והקהלת את
כל עדת בני ישראל והקרבת את
הלוים לפני יהוה וסמכו בני ישראל
את ידיהם על הלוים והניף אהרן
ישראל והיו לעבד לפני יהוה מאת בני
יסמכו את ידיהם על ראש הפרים ועשה את
האחד חטאת ואת האחד עלה ליהוה לכפר
על הלוים והעמדת את הלוים לפני אהרן
ולפני בניו והנפת אתם תנופה ליהוה
והבדלת את הלוים מתוך בני ישראל

Training Details

Base Model: lightonai/LightOnOCR-2-1B-base
Training Method: Fine-tuning with frozen language model backbone
Optimizer: AdamW (fused)
Learning Rate: 6e-5 with linear decay
Precision: bfloat16

Limitations

This model is trained on line-level images only. For full-page transcription, you need to first segment the page into individual lines.
Performance may vary on manuscript styles not represented in the training data.
Old Church Slavonic has many abbreviations and special characters that may require domain-specific post-processing.

Citation

If you use this model, please cite:

@misc{lightonocr2_smp_2026,
  title = {LightOnOCR Fine-tuned for Samaritan Hebrew/Aramaic},
  author = {John Locke},
  year = {2026},
  howpublished = {\url{https://huggingface.co/johnlockejrr/LightOnOCR-2-1B-base-samaritan-pre}}
}

And the original LightOnOCR paper:

@misc{lightonocr2_2026,
  title = {LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR},
  author = {Said Taghadouini and Adrien Cavaill\`{e}s and Baptiste Aubertin},
  year = {2026},
  howpublished = {\url{https://arxiv.org/pdf/2601.14251}}
}

Acknowledgments

LightOn AI for the excellent LightOnOCR base model

Downloads last month: 22

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

8-bit

16-bit

Model tree for johnlockejrr/LightOnOCR-2-1B-base-samaritan-GGUF

Base model

lightonai/LightOnOCR-2-1B-base

Quantized

(2)

this model

Dataset used to train johnlockejrr/LightOnOCR-2-1B-base-samaritan-GGUF

Paper for johnlockejrr/LightOnOCR-2-1B-base-samaritan-GGUF

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Paper • 2601.14251 • Published Jan 20 • 26