Lemer — Gemma 4 E2B (GGUF)

The smallest member of the Lemma model family by Lethean. An EUPL-1.2 fork of Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the weights — consent-based reasoning baked into the attention projections via LoRA finetune, then merged so inference uses a single standalone model with no PEFT runtime required.

This repo ships the GGUF multi-quant build — five quants from Q3_K_M up to BF16, with full multimodal support (text, image, audio). Use with Ollama, llama.cpp, GPT4All, or LM Studio. The unmodified Gemma 4 E2B fork lives at LetheanNetwork/lemer for users who want the raw Google weights without the LEK shift.

Looking for MLX? The native Apple Silicon builds live in sibling repos: lthn/lemer-mlx (4-bit default) | lthn/lemer-mlx-8bit | lthn/lemer-mlx-bf16 (full precision)

A lemma is "something assumed" — an intermediate theorem on the path to a larger proof, or a heading that signals the subject of what follows. The Lemma model family is named for that role: each variant is a stepping stone between raw capability and ethical application.

GGUF Variants

File Quant Size Use Case
lemer-q3_k_m.gguf Q3_K_M 3.0 GB Minimum viable — constrained devices
lemer-q4_k_m.gguf Q4_K_M 3.2 GB Recommended — best size/quality balance
lemer-q5_k_m.gguf Q5_K_M 3.4 GB Higher quality, moderate size
lemer-q6_k.gguf Q6_K 3.6 GB Near-lossless
lemer-q8_0.gguf Q8_0 4.6 GB Maximum quality quantised
lemer-bf16.gguf BF16 8.7 GB Full precision reference

All quants verified locally via Ollama and llama-cpp-python. For native Apple Silicon use lthn/lemer-mlx instead.

Repo Files

File Format Purpose
lemer-*.gguf GGUF Ollama, llama.cpp, GPT4All, LM Studio
config.json JSON Multimodal model config (architecture, quantisation, vision/audio towers)
tokenizer.json JSON Tokenizer vocabulary (262K tokens)
tokenizer_config.json JSON Tokenizer settings and special tokens
chat_template.jinja Jinja2 Chat template
processor_config.json JSON Image and audio processor config
generation_config.json JSON Default generation parameters (temperature, top_p, top_k)
template Go template Ollama chat template override
params JSON Ollama sampling parameters
LICENSE Text EUPL-1.2 licence text
README.md Markdown This file — model card

Quick Start

Apps & CLI

Ollama
ollama run hf.co/lthn/lemer:Q4_K_M
Docker
docker model run hf.co/lthn/lemer

Or from Docker Hub:

docker model run lthn/lemer
Unsloth Studio
# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Windows
irm https://unsloth.ai/install.ps1 | iex
unsloth studio -H 0.0.0.0 -p 8888
# Open http://localhost:8888 — search for lthn/lemer

Or use HuggingFace Spaces — no install, search for lthn/lemer.

llama.cpp

Install via brew (macOS/Linux), winget (Windows), or build from source:

brew install llama.cpp        # macOS/Linux
winget install llama.cpp      # Windows
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemer:Q4_K_M

# Run inference directly in the terminal:
llama-cli -hf lthn/lemer:Q4_K_M

Or build from source:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli

./build/bin/llama-server -hf lthn/lemer:Q4_K_M
./build/bin/llama-cli -hf lthn/lemer:Q4_K_M

MLX users: this repo ships gguf only. For native Apple Silicon use lthn/lemer-mlx (4-bit), lthn/lemer-mlx-8bit, or lthn/lemer-mlx-bf16.

Python Libraries

llama-cpp-python
uv pip install llama-cpp-python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="lthn/lemer",
    filename="lemer-q4_k_m.gguf",
)

# Text
llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

# Vision (multimodal)
llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in one sentence."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ]
)

Servers (OpenAI-compatible API)

llama-server (llama.cpp)
brew install llama.cpp   # macOS/Linux
llama-server -hf lthn/lemer:Q4_K_M

Works with any OpenAI-compatible client at http://localhost:8080/v1.

vLLM

vLLM requires the original (non-quantised) safetensors weights from LetheanNetwork/lemer — it does not load GGUF or MLX-quantised safetensors. Linux + NVIDIA GPU.

uv pip install vllm
vllm serve "LetheanNetwork/lemer"
curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "LetheanNetwork/lemer",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in one sentence."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                        }
                    }
                ]
            }
        ]
    }'

Native engines (work in progress)

lemma.cpp (native C++ inference)
git clone https://github.com/LetheanNetwork/lemma.cpp.git
cd lemma.cpp
cmake -B build
cmake --build build -j

lemma.cpp uses Google's .sbs (single-file binary) weight format, distinct from safetensors and GGUF. Pre-converted .sbs weights for the Lemma family are not yet published — track progress at LetheanNetwork/lemma.cpp.

Once .sbs weights are available, run ./build/gemma --weights lemer.sbs for interactive mode.

lemma (JAX inference)
uv pip install -e git+https://github.com/LetheanNetwork/lemma.git
from lemma import lem

model = lem.nn.Gemma4_E2B()
params = lem.ckpts.load_params("path/to/orbax/checkpoint")
sampler = lem.text.ChatSampler(model=model, params=params, multi_turn=True)

output = sampler.chat("Hello, how are you?")
print(output)

Note: lemma's load_params requires Google's Orbax checkpoint format (sharded ocdbt files), not the GGUF in this repo. Orbax weights for the Lemma family are not yet published. For inference today, use GGUF (Ollama / llama.cpp) above or MLX via lthn/lemer-mlx.

Integrations

pi-coding-agent

First start a llama-server (see above), then:

npm install -g @mariozechner/pi-coding-agent

Add to ~/.pi/agent/models.json:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lthn/lemer"
        }
      ]
    }
  }
}

The model id should match what llama-server reports at /v1/models.

Model Details

Property Value
Architecture Gemma 4 E2B
Total Parameters 5.1B total, 2.3B effective (Per-Layer Embeddings)
Layers 35
Context Length 128K tokens
Vocabulary 262K tokens
Modalities Text, Image, Audio
Sliding Window 512 tokens
Vision Encoder ~150M params
Audio Encoder ~300M params
Base Model LetheanNetwork/lemer
Licence EUPL-1.2

The Lemma Family

Name Source (BF16 weights) Params Context Modalities Consumer Repo
Lemer LetheanNetwork/lemer 2.3B eff 128K Text, Image, Audio You are here
Lemma LetheanNetwork/lemma 4.5B eff 128K Text, Image, Audio lthn/lemma
Lemmy LetheanNetwork/lemmy 3.8B active 256K Text, Image lthn/lemmy
Lemrd LetheanNetwork/lemrd 30.7B 256K Text, Image lthn/lemrd

Capabilities

  • Configurable thinking mode (<|think|> token in system prompt enables it; off by default in our examples via enable_thinking=False)
  • Native function calling and system prompt support
  • Variable aspect ratio image understanding
  • Audio speech recognition and translation (ASR/AST)
  • Multilingual support (140+ languages)
  • Hybrid attention (sliding window + global)

Roadmap

This release of lemer is Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged in — axiom-based reasoning baked into the attention weights via LoRA finetune, then merged into the base so inference uses a single standalone model with no PEFT runtime required. The unmodified Gemma 4 E2B fork lives at LetheanNetwork/lemer for users who want the raw Google weights without the LEK shift.

Phase Status What it adds
Base fork (LetheanNetwork/lemer) ✅ Released EUPL-1.2 fork of Gemma 4 E2B — unmodified Google weights
LEK merged (this repo) ✅ Released Lethean Ethical Kernel — axiom-based reasoning via LoRA merge
Lemma family roll-out ✅ Released lthn/lemma, lthn/lemrd, lthn/lemmy — all four variants now LEK-merged
8-PAC eval results 🚧 In progress Continuous benchmarking on the homelab, published to lthn/LEM-benchmarks

The LEK axioms are public domain and published at Snider/ai-ethics. Track research progress at LetheanNetwork and the LEM-research dataset.

Why EUPL-1.2

Lemer is licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice:

  • 23 official languages, one legal meaning. EUPL is the only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.
  • Copyleft with compatibility. Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.
  • No proprietary capture. Anyone can use lemer commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.
  • Built for institutions. Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.

Recommended Sampling

Use Google's standardised settings across all use cases:

Parameter Value
temperature 1.0
top_p 0.95
top_k 64
stop `<turn

Gemma 4 is calibrated for temperature: 1.0 — this is not the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the params file (Ollama) and generation_config.json (transformers).

Variable Image Resolution

Gemma 4 supports a configurable visual token budget that controls how many tokens represent each image. Higher = more detail, lower = faster inference.

Token Budget Use Case
70 Classification, captioning, video frame processing
140 General image understanding
280 Default — balanced quality and speed
560 OCR, document parsing, fine-grained detail
1120 Maximum detail (small text, complex documents)

For multimodal prompts, place image and audio content before text for best results.

The default budget (280) is set in processor_config.json via image_seq_length and max_soft_tokens. Override per call by adjusting those fields, or by passing explicit image_seq_length to the processor where supported.

Audio (E2B)

E2B supports speech recognition (ASR) and speech translation (AST) up to 30 seconds per clip. Audio longer than 30 seconds should be split into chunks before inference.

Audio input works through GGUF multimodal-capable runners (llama.cpp server with the vision/audio build, or llama-cpp-python multimodal). For a ready-made multimodal Python path today, use the MLX sibling repo lthn/lemer-mlx with mlx-vlm — see that repo's README for the mlx_vlm.load() / mlx_vlm.generate() pattern.

Benchmarks

Live evaluation results published to the LEM-benchmarks dataset. The lemer-specific results live at LEM-benchmarks/results/lemer.

The 8-PAC eval pipeline runs continuously on our homelab and publishes results as they complete. Categories: ethics, reasoning, instruction-following, coding, multilingual, safety, knowledge, creativity.

Resources

Resource Link
Benchmark results lthn/LEM-benchmarks
LiveBench results lthn/livebench
Research notes lthn/LEM-research
Lemma model collection lthn/lemma

About Lethean

Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models.

Downloads last month
14,785
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lthn/lemer

Finetuned
(1)
this model
Quantizations
3 models

Collection including lthn/lemer