Gemma 4 E4B — Opus Reasoning + Claude Code | Tool Calling ✅ | OpenHarness ✅ | OpenClaw ✅ | Hermes Agent ✅ | Reasoning Baked In

Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. <think> reasoning baked in — no adapter needed. 10.5 GB.

Reasoning baked in. No adapter needed. Built by RavenX AI

Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights — no adapter needed, no extra memory, just load and run with Claude-style <think> reasoning baked in.

~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.

This is gemma-4-E4B-mlx-4bit with the Opus Reasoning + Claude Code LoRA merged directly into the base weights using mlx weight arithmetic.

What's different from the base model

	Base model	This model
`<think>` tag reasoning	❌	✅ baked in
Claude-style structured answers	❌	✅
Tool-use patterns	❌	✅
Requires adapter	—	❌ no adapter needed
File size	4.86 GB (4-bit)	~10.5 GB (bfloat16 merged)
Vision support	✅	✅

🧪 Live Demos — Try It Now

Space	What to try
🔥 Agentic Tool Calling Demo	Live agentic loop — tool calling, `<think>` reasoning, calculator, web search
🐳 OpenClaw Sandbox Demo	OpenClaw-style orchestration, Docker runtime, sandbox/approval modes

Quickstart

pip install mlx-lm mlx-vlm

from mlx_lm import load, generate

# No adapter_path needed — reasoning is in the weights
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")

messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
# → Will produce <think>...</think> followed by structured answer

CLI

mlx_lm.generate \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
  --max-tokens 1024

🧩 OpenHarness + OpenClaw + Hermes Agent

This model is built to sit inside a real agent stack, not just a chat box.

We support:

OpenHarness for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
OpenClaw for orchestration, sessions, reminders, and cross-agent routing
Hermes agent skill for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution

Why this combo matters

Layer	Role
Gemma 4 E4B Opus Reasoning + Claude Code	reasoning + tool-use behavior baked into the weights
Gemini CLI	coding agent + tool orchestration
OpenHarness	harness runtime, tool loop, swarm, hooks, memory
OpenClaw	orchestration, sessions, skills, messaging
Hermes skill	agent behavior for concise, terminal-first execution

OpenHarness quickstart

pip install openharness

mlx_lm.server \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --port 8080

oh --model http://localhost:8080/v1 \
   --skill hermes-agent \
   -p "Review this repo, find bugs, patch them, and summarize the result"

OpenClaw skill stack

Inside OpenClaw, pair this model with:

openharness skill — run/configure oh
hermes-agent skill — shape coding-agent behavior

That gives you a fully local Apple Silicon agent lane with:

baked-in reasoning
native tool calling
Gemini CLI integration
OpenHarness runtime support
OpenClaw orchestration

💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability	How
Code generation	Gemini CLI reads your codebase, model reasons with `<think>` tags
Tool calling	Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools
Long context	1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers	Connect any MCP server — databases, APIs, custom tools
Search grounding	Google Search built in — model gets live data

# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

⚡ TurboQuant-MLX — 4.6x KV Cache Compression

Pair with TurboQuant-MLX to compress the KV cache and run 4.6x longer reasoning chains at the same memory:

from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
# Long reasoning chains now fit in the same RAM budget

→ TurboQuant-MLX on GitHub · v2.0 Release

How it was made

Training data

Source	Examples
Crownelius/Opus-4.6-Reasoning-2100x-formatted	2,054
Claude Code tool-use patterns	140 files
Total	2,163

Training

Base:      deadbydawn101/gemma-4-E4B-mlx-4bit
Method:    SFT completions-only (mlx_vlm.lora)
Rank:      8 · Alpha: 16 · LR: 1e-5 · Iters: 1,000
Hardware:  Apple M4 Max 128GB · Peak mem: 7.876 GB

Final loss: ~3.5e-7

Fusion

All 378 LoRA pairs merged via weight arithmetic:

W_merged = dequantize(W_base) + (A @ B).T × (alpha / rank)

Result dequantized to bfloat16 and saved as 3-shard safetensors.

🦙 Ollama — One-Command Setup

Instant run (no install needed)

ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

With a custom system prompt + tool support

Create a Modelfile:

FROM hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4

OpenAI-compatible endpoint

# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Run with mlx_lm server (native, faster on Apple Silicon)

# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080

# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'

Related models

Model	Size	Notes
gemma-4-E4B-mlx-4bit	4.86 GB	Base model (4-bit, use with adapter)
gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit	~10.5 GB	This model — fused, no adapter needed
gemma-4-E4B-opus-reasoning-claude-code-lora	658 MB	Adapter-only
gemma-4-E2B-Heretic-Uncensored-mlx-4bit	3.34 GB	2B abliterated
gemma-4-21b-REAP-Tool-Calling-mlx-4bit	12 GB	21B MoE REAP

License

Gemma Terms of Use

Built with 🖤 by RavenX AI · TurboQuant-MLX · Gemini CLI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention

Downloads last month: 5,610

MLX

Hardware compatibility

4-bit

Model tree for deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

Base model

google/gemma-4-E4B-it

Quantized

deadbydawn101/gemma-4-E4B-mlx-4bit

Finetuned

(1)

this model

Dataset used to train deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

Spaces using deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit 2

Collection including deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

RavenX MLX Models — Apple Silicon Inference Stack

Collection

TurboQuant 4-bit mlx-lm models. TriAttention compatible. PR #1 merged MIT+NVIDIA. • 7 items • Updated 3 days ago