---
license: eupl-1.2
pipeline_tag: image-text-to-text
library_name: gguf
base_model:
- LetheanNetwork/lemma
base_model_relation: finetune
tags:
- gemma4
- lemma
- gguf
- llama.cpp
- ollama
- multimodal
- vision
- audio
- on-device
- conversational
---
# Lemma — Gemma 4 E4B (GGUF)
The mid-sized member of the [Lemma model family](https://huggingface.co/collections/lthn/lemma) by [Lethean](https://lthn.ai). An EUPL-1.2 fork of [Gemma 4 E4B](https://huggingface.co/google/gemma-4-E4B-it) with the **Lethean Ethical Kernel (LEK) merged into the weights** — consent-based reasoning baked into the attention projections via LoRA finetune, then merged so inference uses a single standalone model with no PEFT runtime required.
This repo ships the **GGUF multi-quant build** — five quants from Q4_K_M up to BF16, with full multimodal support (text, image, audio). Use with Ollama, llama.cpp, GPT4All, or LM Studio. The unmodified Gemma 4 E4B fork lives at [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) for users who want the raw Google weights without the LEK shift.
**Looking for MLX?** The native Apple Silicon builds live in sibling repos:
[`lthn/lemma-mlx`](https://huggingface.co/lthn/lemma-mlx) (4-bit default) |
[`lthn/lemma-mlx-8bit`](https://huggingface.co/lthn/lemma-mlx-8bit) |
[`lthn/lemma-mlx-bf16`](https://huggingface.co/lthn/lemma-mlx-bf16) (full precision)
> A **lemma** is "something assumed" — an intermediate theorem on the path to a larger proof, or a heading that signals the subject of what follows. The Lemma model family is named for that role: each variant is a stepping stone between raw capability and ethical application.
## GGUF Variants
| File | Quant | Size | Use Case |
|------|-------|------|----------|
| `lemma-q4_k_m.gguf` | Q4_K_M | 5.0 GB | **Recommended** — best size/quality balance |
| `lemma-q5_k_m.gguf` | Q5_K_M | 5.4 GB | Higher quality, moderate size |
| `lemma-q6_k.gguf` | Q6_K | 5.8 GB | Near-lossless |
| `lemma-q8_0.gguf` | Q8_0 | 7.5 GB | Maximum quality quantised |
| `lemma-bf16.gguf` | BF16 | 14 GB | Full precision reference |
All variants verified locally on Apple Silicon via Ollama, llama-cpp-python, mlx-lm, and mlx-vlm.
### Repo Files
| File | Format | Purpose |
|------|--------|---------|
| `lemma-*.gguf` | GGUF | Ollama, llama.cpp, GPT4All, LM Studio |
| `model-*-of-00002.safetensors` | MLX safetensors (sharded) | Native Apple Silicon via `mlx-lm` and `mlx-vlm` (Q4 multimodal) |
| `model.safetensors.index.json` | JSON | Tensor index for the sharded safetensors weights |
| `config.json` | JSON | Multimodal model config (architecture, quantisation, vision/audio towers) |
| `tokenizer.json` | JSON | Tokenizer vocabulary (262K tokens) |
| `tokenizer_config.json` | JSON | Tokenizer settings and special tokens |
| `chat_template.jinja` | Jinja2 | Chat template for transformers, mlx-lm, mlx-vlm |
| `processor_config.json` | JSON | Image and audio processor config (mlx-vlm) |
| `generation_config.json` | JSON | Default generation parameters (temperature, top_p, top_k) |
| `LICENSE` | Text | EUPL-1.2 licence text |
| `README.md` | Markdown | This file — model card |
## Quick Start
### Apps & CLI
Ollama
```bash
ollama run hf.co/lthn/lemma:Q4_K_M
```
Docker
```bash
docker model run hf.co/lthn/lemma
```
Or from Docker Hub:
```bash
docker model run lthn/lemma
```
Unsloth Studio
```bash
# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows
irm https://unsloth.ai/install.ps1 | iex
```
```bash
unsloth studio -H 0.0.0.0 -p 8888
# Open http://localhost:8888 — search for lthn/lemma
```
Or use [HuggingFace Spaces](https://huggingface.co/spaces/unsloth/studio) — no install, search for `lthn/lemma`.
llama.cpp
Install via brew (macOS/Linux), winget (Windows), or build from source:
```bash
brew install llama.cpp # macOS/Linux
winget install llama.cpp # Windows
```
```bash
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemma:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lthn/lemma:Q4_K_M
```
Or build from source:
```bash
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
./build/bin/llama-server -hf lthn/lemma:Q4_K_M
./build/bin/llama-cli -hf lthn/lemma:Q4_K_M
```
MLX (Apple Silicon native)
```bash
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemma
mlx_lm.generate --model lthn/lemma --prompt "Hello, how are you?"
```
### Python Libraries
llama-cpp-python
```bash
uv pip install llama-cpp-python
```
```python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="lthn/lemma",
filename="lemma-q4_k_m.gguf",
)
# Text
llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
# Vision (multimodal)
llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in one sentence."},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)
```
mlx-vlm (vision + audio)
```bash
uv tool install mlx-vlm
```
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemma")
config = load_config("lthn/lemma")
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(model, processor, formatted_prompt, image)
print(output.text)
```
### Servers (OpenAI-compatible API)
MLX Server
`lemma` is multimodal, so use `mlx_vlm.server` — the vision-aware variant that handles image and audio inputs. The text-only `mlx_lm.server` does not correctly route multimodal tensors for Gemma 4.
```bash
mlx_vlm.server --model lthn/lemma
```
```bash
curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lthn/lemma",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 200
}'
```
Works with any OpenAI-compatible client at `http://localhost:8080/v1`.
vLLM
> vLLM requires the original (non-quantised) safetensors weights from [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) — it does not load GGUF or MLX-quantised safetensors. Linux + NVIDIA GPU.
```bash
uv pip install vllm
vllm serve "LetheanNetwork/lemma"
```
## Model Details
| Property | Value |
|----------|-------|
| **Architecture** | Gemma 4 E4B |
| **Total Parameters** | 7.9B total, 4.5B effective (Per-Layer Embeddings) |
| **Layers** | 34 |
| **Context Length** | 128K tokens |
| **Vocabulary** | 262K tokens |
| **Modalities** | Text, Image, Audio |
| **Sliding Window** | 512 tokens |
| **Vision Encoder** | ~150M params |
| **Audio Encoder** | ~300M params |
| **Base Model** | [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) |
| **Licence** | EUPL-1.2 |
## The Lemma Family
| Name | Source (BF16 weights) | Params | Context | Modalities | Consumer Repo |
|------|----------------------|--------|---------|------------|---------------|
| **Lemer** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | 2.3B eff | 128K | Text, Image, Audio | [lthn/lemer](https://huggingface.co/lthn/lemer) |
| **Lemma** | [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) | 4.5B eff | 128K | Text, Image, Audio | You are here |
| **Lemmy** | [LetheanNetwork/lemmy](https://huggingface.co/LetheanNetwork/lemmy) | 3.8B active | 256K | Text, Image | [lthn/lemmy](https://huggingface.co/lthn/lemmy) |
| **Lemrd** | [LetheanNetwork/lemrd](https://huggingface.co/LetheanNetwork/lemrd) | 30.7B | 256K | Text, Image | [lthn/lemrd](https://huggingface.co/lthn/lemrd) |
## Capabilities
- Configurable thinking mode (`<|think|>` token in system prompt enables it; off by default in our examples via `enable_thinking=False`)
- Native function calling and system prompt support
- Variable aspect ratio image understanding
- Audio speech recognition and translation (ASR/AST)
- Multilingual support (140+ languages)
- Hybrid attention (sliding window + global)
## Roadmap
This release of `lemma` is **Gemma 4 E4B with the Lethean Ethical Kernel (LEK) merged in** — axiom-based reasoning baked into the attention weights via LoRA finetune, then merged into the base so inference uses a single standalone model with no PEFT runtime required. The unmodified Gemma 4 E4B fork lives at [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) for users who want the raw Google weights without the LEK shift.
| Phase | Status | What it adds |
|-------|--------|--------------|
| **Base fork** ([LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma)) | ✅ Released | EUPL-1.2 fork of Gemma 4 E4B — unmodified Google weights |
| **LEK merged** (this repo) | ✅ Released | Lethean Ethical Kernel — axiom-based reasoning via LoRA merge |
| **8-PAC eval results** | 🚧 In progress | Continuous benchmarking on the homelab, published to [lthn/LEM-benchmarks](https://huggingface.co/datasets/lthn/LEM-benchmarks) |
The LEK axioms are public domain and published at [Snider/ai-ethics](https://github.com/Snider/ai-ethics). Track research progress at [LetheanNetwork](https://github.com/LetheanNetwork) and the [LEM-research dataset](https://huggingface.co/datasets/lthn/LEM-research).
## Why EUPL-1.2
Lemma is licensed under the [European Union Public Licence v1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12) — not Apache 2.0 or MIT. This is a deliberate choice:
- **23 official languages, one legal meaning.** EUPL is the only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.
- **Copyleft with compatibility.** Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.
- **No proprietary capture.** Anyone can use lemma commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.
- **Built for institutions.** Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.
## Recommended Sampling
Use Google's standardised settings across all use cases:
| Parameter | Value |
|-----------|-------|
| `temperature` | 1.0 |
| `top_p` | 0.95 |
| `top_k` | 64 |
| `stop` | ``, `` |
> Gemma 4 is calibrated for `temperature: 1.0` — this is **not** the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the `params` file (Ollama) and `generation_config.json` (transformers/MLX).
## Variable Image Resolution
Gemma 4 supports a configurable visual token budget that controls how many tokens represent each image. Higher = more detail, lower = faster inference.
| Token Budget | Use Case |
|--------------|----------|
| 70 | Classification, captioning, video frame processing |
| 140 | General image understanding |
| **280** | Default — balanced quality and speed |
| 560 | OCR, document parsing, fine-grained detail |
| 1120 | Maximum detail (small text, complex documents) |
For multimodal prompts, place image and audio content **before** text for best results.
The default budget (`280`) is set in `processor_config.json` via `image_seq_length` and `max_soft_tokens`. Override per call by adjusting those fields, or by passing explicit `image_seq_length` to the processor where supported.
## Audio (E4B)
E4B supports speech recognition (ASR) and speech translation (AST) up to 30 seconds per clip via mlx-vlm. Audio longer than 30 seconds should be split into chunks before inference. Install mlx-vlm with `uv tool install mlx-vlm` (or see the MLX quick start above).
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemma")
config = load_config("lthn/lemma")
# Audio file — wav, mp3 native; m4a, aac, ogg, opus via ffmpeg
audio = ["path/to/speech.wav"]
prompt = """Transcribe the following speech segment in English into English text.
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."""
formatted_prompt = apply_chat_template(processor, config, prompt, num_audios=1)
output = generate(model, processor, formatted_prompt, audio=audio)
print(output.text)
```
## Benchmarks
Live evaluation results published to the [LEM-benchmarks dataset](https://huggingface.co/datasets/lthn/LEM-benchmarks). The lemma-specific results live at [LEM-benchmarks/results/lemma](https://huggingface.co/datasets/lthn/LEM-benchmarks/tree/main/results/lemma).
The 8-PAC eval pipeline runs continuously on our homelab and publishes results as they complete. Categories: ethics, reasoning, instruction-following, coding, multilingual, safety, knowledge, creativity.
## Resources
| Resource | Link |
|----------|------|
| **Benchmark results** | [lthn/LEM-benchmarks](https://huggingface.co/datasets/lthn/LEM-benchmarks) |
| **LiveBench results** | [lthn/livebench](https://huggingface.co/datasets/lthn/livebench) |
| **Research notes** | [lthn/LEM-research](https://huggingface.co/datasets/lthn/LEM-research) |
| **Lemma model collection** | [lthn/lemma](https://huggingface.co/collections/lthn/lemma) |
## About Lethean
[Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project — training protocol and tooling for intrinsic ethical alignment of language models.
- Website: [lthn.ai](https://lthn.ai)
- GitHub: [LetheanNetwork](https://github.com/LetheanNetwork)
- Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)