--- license: eupl-1.2 pipeline_tag: image-text-to-text library_name: gguf base_model: - LetheanNetwork/lemma base_model_relation: finetune tags: - gemma4 - lemma - gguf - llama.cpp - ollama - multimodal - vision - audio - on-device - conversational --- # Lemma — Gemma 4 E4B (GGUF) The mid-sized member of the [Lemma model family](https://huggingface.co/collections/lthn/lemma) by [Lethean](https://lthn.ai). An EUPL-1.2 fork of [Gemma 4 E4B](https://huggingface.co/google/gemma-4-E4B-it) with the **Lethean Ethical Kernel (LEK) merged into the weights** — consent-based reasoning baked into the attention projections via LoRA finetune, then merged so inference uses a single standalone model with no PEFT runtime required. This repo ships the **GGUF multi-quant build** — five quants from Q4_K_M up to BF16, with full multimodal support (text, image, audio). Use with Ollama, llama.cpp, GPT4All, or LM Studio. The unmodified Gemma 4 E4B fork lives at [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) for users who want the raw Google weights without the LEK shift. **Looking for MLX?** The native Apple Silicon builds live in sibling repos: [`lthn/lemma-mlx`](https://huggingface.co/lthn/lemma-mlx) (4-bit default) | [`lthn/lemma-mlx-8bit`](https://huggingface.co/lthn/lemma-mlx-8bit) | [`lthn/lemma-mlx-bf16`](https://huggingface.co/lthn/lemma-mlx-bf16) (full precision) > A **lemma** is "something assumed" — an intermediate theorem on the path to a larger proof, or a heading that signals the subject of what follows. The Lemma model family is named for that role: each variant is a stepping stone between raw capability and ethical application. ## GGUF Variants | File | Quant | Size | Use Case | |------|-------|------|----------| | `lemma-q4_k_m.gguf` | Q4_K_M | 5.0 GB | **Recommended** — best size/quality balance | | `lemma-q5_k_m.gguf` | Q5_K_M | 5.4 GB | Higher quality, moderate size | | `lemma-q6_k.gguf` | Q6_K | 5.8 GB | Near-lossless | | `lemma-q8_0.gguf` | Q8_0 | 7.5 GB | Maximum quality quantised | | `lemma-bf16.gguf` | BF16 | 14 GB | Full precision reference | All variants verified locally on Apple Silicon via Ollama, llama-cpp-python, mlx-lm, and mlx-vlm. ### Repo Files | File | Format | Purpose | |------|--------|---------| | `lemma-*.gguf` | GGUF | Ollama, llama.cpp, GPT4All, LM Studio | | `model-*-of-00002.safetensors` | MLX safetensors (sharded) | Native Apple Silicon via `mlx-lm` and `mlx-vlm` (Q4 multimodal) | | `model.safetensors.index.json` | JSON | Tensor index for the sharded safetensors weights | | `config.json` | JSON | Multimodal model config (architecture, quantisation, vision/audio towers) | | `tokenizer.json` | JSON | Tokenizer vocabulary (262K tokens) | | `tokenizer_config.json` | JSON | Tokenizer settings and special tokens | | `chat_template.jinja` | Jinja2 | Chat template for transformers, mlx-lm, mlx-vlm | | `processor_config.json` | JSON | Image and audio processor config (mlx-vlm) | | `generation_config.json` | JSON | Default generation parameters (temperature, top_p, top_k) | | `LICENSE` | Text | EUPL-1.2 licence text | | `README.md` | Markdown | This file — model card | ## Quick Start ### Apps & CLI
Ollama ```bash ollama run hf.co/lthn/lemma:Q4_K_M ```
Docker ```bash docker model run hf.co/lthn/lemma ``` Or from Docker Hub: ```bash docker model run lthn/lemma ```
Unsloth Studio ```bash # macOS / Linux / WSL curl -fsSL https://unsloth.ai/install.sh | sh # Windows irm https://unsloth.ai/install.ps1 | iex ``` ```bash unsloth studio -H 0.0.0.0 -p 8888 # Open http://localhost:8888 — search for lthn/lemma ``` Or use [HuggingFace Spaces](https://huggingface.co/spaces/unsloth/studio) — no install, search for `lthn/lemma`.
llama.cpp Install via brew (macOS/Linux), winget (Windows), or build from source: ```bash brew install llama.cpp # macOS/Linux winget install llama.cpp # Windows ``` ```bash # Start a local OpenAI-compatible server with a web UI: llama-server -hf lthn/lemma:Q4_K_M # Run inference directly in the terminal: llama-cli -hf lthn/lemma:Q4_K_M ``` Or build from source: ```bash git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli ./build/bin/llama-server -hf lthn/lemma:Q4_K_M ./build/bin/llama-cli -hf lthn/lemma:Q4_K_M ```
MLX (Apple Silicon native) ```bash uv tool install mlx-lm mlx_lm.chat --model lthn/lemma mlx_lm.generate --model lthn/lemma --prompt "Hello, how are you?" ```
### Python Libraries
llama-cpp-python ```bash uv pip install llama-cpp-python ``` ```python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lthn/lemma", filename="lemma-q4_k_m.gguf", ) # Text llm.create_chat_completion( messages=[{"role": "user", "content": "Hello, how are you?"}] ) # Vision (multimodal) llm.create_chat_completion( messages=[ { "role": "user", "content": [ {"type": "text", "text": "Describe this image in one sentence."}, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) ```
mlx-vlm (vision + audio) ```bash uv tool install mlx-vlm ``` ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config model, processor = load("lthn/lemma") config = load_config("lthn/lemma") image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) output = generate(model, processor, formatted_prompt, image) print(output.text) ```
### Servers (OpenAI-compatible API)
MLX Server `lemma` is multimodal, so use `mlx_vlm.server` — the vision-aware variant that handles image and audio inputs. The text-only `mlx_lm.server` does not correctly route multimodal tensors for Gemma 4. ```bash mlx_vlm.server --model lthn/lemma ``` ```bash curl -X POST "http://localhost:8080/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lthn/lemma", "messages": [{"role": "user", "content": "Hello, how are you?"}], "max_tokens": 200 }' ``` Works with any OpenAI-compatible client at `http://localhost:8080/v1`.
vLLM > vLLM requires the original (non-quantised) safetensors weights from [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) — it does not load GGUF or MLX-quantised safetensors. Linux + NVIDIA GPU. ```bash uv pip install vllm vllm serve "LetheanNetwork/lemma" ```
## Model Details | Property | Value | |----------|-------| | **Architecture** | Gemma 4 E4B | | **Total Parameters** | 7.9B total, 4.5B effective (Per-Layer Embeddings) | | **Layers** | 34 | | **Context Length** | 128K tokens | | **Vocabulary** | 262K tokens | | **Modalities** | Text, Image, Audio | | **Sliding Window** | 512 tokens | | **Vision Encoder** | ~150M params | | **Audio Encoder** | ~300M params | | **Base Model** | [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) | | **Licence** | EUPL-1.2 | ## The Lemma Family | Name | Source (BF16 weights) | Params | Context | Modalities | Consumer Repo | |------|----------------------|--------|---------|------------|---------------| | **Lemer** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | 2.3B eff | 128K | Text, Image, Audio | [lthn/lemer](https://huggingface.co/lthn/lemer) | | **Lemma** | [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) | 4.5B eff | 128K | Text, Image, Audio | You are here | | **Lemmy** | [LetheanNetwork/lemmy](https://huggingface.co/LetheanNetwork/lemmy) | 3.8B active | 256K | Text, Image | [lthn/lemmy](https://huggingface.co/lthn/lemmy) | | **Lemrd** | [LetheanNetwork/lemrd](https://huggingface.co/LetheanNetwork/lemrd) | 30.7B | 256K | Text, Image | [lthn/lemrd](https://huggingface.co/lthn/lemrd) | ## Capabilities - Configurable thinking mode (`<|think|>` token in system prompt enables it; off by default in our examples via `enable_thinking=False`) - Native function calling and system prompt support - Variable aspect ratio image understanding - Audio speech recognition and translation (ASR/AST) - Multilingual support (140+ languages) - Hybrid attention (sliding window + global) ## Roadmap This release of `lemma` is **Gemma 4 E4B with the Lethean Ethical Kernel (LEK) merged in** — axiom-based reasoning baked into the attention weights via LoRA finetune, then merged into the base so inference uses a single standalone model with no PEFT runtime required. The unmodified Gemma 4 E4B fork lives at [LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma) for users who want the raw Google weights without the LEK shift. | Phase | Status | What it adds | |-------|--------|--------------| | **Base fork** ([LetheanNetwork/lemma](https://huggingface.co/LetheanNetwork/lemma)) | ✅ Released | EUPL-1.2 fork of Gemma 4 E4B — unmodified Google weights | | **LEK merged** (this repo) | ✅ Released | Lethean Ethical Kernel — axiom-based reasoning via LoRA merge | | **8-PAC eval results** | 🚧 In progress | Continuous benchmarking on the homelab, published to [lthn/LEM-benchmarks](https://huggingface.co/datasets/lthn/LEM-benchmarks) | The LEK axioms are public domain and published at [Snider/ai-ethics](https://github.com/Snider/ai-ethics). Track research progress at [LetheanNetwork](https://github.com/LetheanNetwork) and the [LEM-research dataset](https://huggingface.co/datasets/lthn/LEM-research). ## Why EUPL-1.2 Lemma is licensed under the [European Union Public Licence v1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12) — not Apache 2.0 or MIT. This is a deliberate choice: - **23 official languages, one legal meaning.** EUPL is the only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law. - **Copyleft with compatibility.** Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing. - **No proprietary capture.** Anyone can use lemma commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open. - **Built for institutions.** Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one. ## Recommended Sampling Use Google's standardised settings across all use cases: | Parameter | Value | |-----------|-------| | `temperature` | 1.0 | | `top_p` | 0.95 | | `top_k` | 64 | | `stop` | ``, `` | > Gemma 4 is calibrated for `temperature: 1.0` — this is **not** the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the `params` file (Ollama) and `generation_config.json` (transformers/MLX). ## Variable Image Resolution Gemma 4 supports a configurable visual token budget that controls how many tokens represent each image. Higher = more detail, lower = faster inference. | Token Budget | Use Case | |--------------|----------| | 70 | Classification, captioning, video frame processing | | 140 | General image understanding | | **280** | Default — balanced quality and speed | | 560 | OCR, document parsing, fine-grained detail | | 1120 | Maximum detail (small text, complex documents) | For multimodal prompts, place image and audio content **before** text for best results. The default budget (`280`) is set in `processor_config.json` via `image_seq_length` and `max_soft_tokens`. Override per call by adjusting those fields, or by passing explicit `image_seq_length` to the processor where supported. ## Audio (E4B) E4B supports speech recognition (ASR) and speech translation (AST) up to 30 seconds per clip via mlx-vlm. Audio longer than 30 seconds should be split into chunks before inference. Install mlx-vlm with `uv tool install mlx-vlm` (or see the MLX quick start above). ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config model, processor = load("lthn/lemma") config = load_config("lthn/lemma") # Audio file — wav, mp3 native; m4a, aac, ogg, opus via ffmpeg audio = ["path/to/speech.wav"] prompt = """Transcribe the following speech segment in English into English text. Follow these specific instructions for formatting the answer: * Only output the transcription, with no newlines. * When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.""" formatted_prompt = apply_chat_template(processor, config, prompt, num_audios=1) output = generate(model, processor, formatted_prompt, audio=audio) print(output.text) ``` ## Benchmarks Live evaluation results published to the [LEM-benchmarks dataset](https://huggingface.co/datasets/lthn/LEM-benchmarks). The lemma-specific results live at [LEM-benchmarks/results/lemma](https://huggingface.co/datasets/lthn/LEM-benchmarks/tree/main/results/lemma). The 8-PAC eval pipeline runs continuously on our homelab and publishes results as they complete. Categories: ethics, reasoning, instruction-following, coding, multilingual, safety, knowledge, creativity. ## Resources | Resource | Link | |----------|------| | **Benchmark results** | [lthn/LEM-benchmarks](https://huggingface.co/datasets/lthn/LEM-benchmarks) | | **LiveBench results** | [lthn/livebench](https://huggingface.co/datasets/lthn/livebench) | | **Research notes** | [lthn/LEM-research](https://huggingface.co/datasets/lthn/LEM-research) | | **Lemma model collection** | [lthn/lemma](https://huggingface.co/collections/lthn/lemma) | ## About Lethean [Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project — training protocol and tooling for intrinsic ethical alignment of language models. - Website: [lthn.ai](https://lthn.ai) - GitHub: [LetheanNetwork](https://github.com/LetheanNetwork) - Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)