Gemma 4 E4B — Opus Reasoning + Claude Code | Tool Calling ✅ | OpenHarness ✅ | OpenClaw ✅ | Hermes Agent ✅ | Reasoning Baked In
Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill.
<think>reasoning baked in — no adapter needed. 10.5 GB.
Reasoning baked in. No adapter needed. Built by RavenX AI
Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights — no adapter needed, no extra memory, just load and run with Claude-style <think> reasoning baked in.
~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.
This is gemma-4-E4B-mlx-4bit with the Opus Reasoning + Claude Code LoRA merged directly into the base weights using mlx weight arithmetic.
What's different from the base model
| Base model | This model | |
|---|---|---|
<think> tag reasoning |
❌ | ✅ baked in |
| Claude-style structured answers | ❌ | ✅ |
| Tool-use patterns | ❌ | ✅ |
| Requires adapter | — | ❌ no adapter needed |
| File size | 4.86 GB (4-bit) | ~10.5 GB (bfloat16 merged) |
| Vision support | ✅ | ✅ |
🧪 Live Demos — Try It Now
| Space | What to try |
|---|---|
| 🔥 Agentic Tool Calling Demo | Live agentic loop — tool calling, <think> reasoning, calculator, web search |
| 🐳 OpenClaw Sandbox Demo | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes |
Quickstart
pip install mlx-lm mlx-vlm
from mlx_lm import load, generate
# No adapter_path needed — reasoning is in the weights
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
# → Will produce <think>...</think> followed by structured answer
CLI
mlx_lm.generate \
--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
--prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
--max-tokens 1024
🧩 OpenHarness + OpenClaw + Hermes Agent
This model is built to sit inside a real agent stack, not just a chat box.
We support:
- OpenHarness for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
- OpenClaw for orchestration, sessions, reminders, and cross-agent routing
- Hermes agent skill for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution
Why this combo matters
| Layer | Role |
|---|---|
| Gemma 4 E4B Opus Reasoning + Claude Code | reasoning + tool-use behavior baked into the weights |
| Gemini CLI | coding agent + tool orchestration |
| OpenHarness | harness runtime, tool loop, swarm, hooks, memory |
| OpenClaw | orchestration, sessions, skills, messaging |
| Hermes skill | agent behavior for concise, terminal-first execution |
OpenHarness quickstart
pip install openharness
mlx_lm.server \
--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
--port 8080
oh --model http://localhost:8080/v1 \
--skill hermes-agent \
-p "Review this repo, find bugs, patch them, and summarize the result"
OpenClaw skill stack
Inside OpenClaw, pair this model with:
openharnessskill — run/configureohhermes-agentskill — shape coding-agent behavior
That gives you a fully local Apple Silicon agent lane with:
- baked-in reasoning
- native tool calling
- Gemini CLI integration
- OpenHarness runtime support
- OpenClaw orchestration
💻 Gemini CLI — Coding Agent + Tool Orchestration
We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.
Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.
# Install
npm install -g @google/gemini-cli
# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080
# Or use directly against Gemini API (free tier: 60 req/min)
gemini
What Gemini CLI + these models unlock together
| Capability | How |
|---|---|
| Code generation | Gemini CLI reads your codebase, model reasons with <think> tags |
| Tool calling | Native <|tool> tokens → Gemini CLI executes shell/file/web tools |
| Long context | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| MCP servers | Connect any MCP server — databases, APIs, custom tools |
| Search grounding | Google Search built in — model gets live data |
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
"Review all Python files in ./src, find potential bugs, and suggest fixes"
# Gemini CLI will: read files → call tools → model reasons → produce structured output
→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible
⚡ TurboQuant-MLX — 4.6x KV Cache Compression
Pair with TurboQuant-MLX to compress the KV cache and run 4.6x longer reasoning chains at the same memory:
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module
cache_module.make_prompt_cache = lambda model, **kw: [
TurboQuantKVCache() for _ in range(len(model.layers))
]
from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
# Long reasoning chains now fit in the same RAM budget
→ TurboQuant-MLX on GitHub · v2.0 Release
How it was made
Training data
| Source | Examples |
|---|---|
| Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 |
| Claude Code tool-use patterns | 140 files |
| Total | 2,163 |
Training
Base: deadbydawn101/gemma-4-E4B-mlx-4bit
Method: SFT completions-only (mlx_vlm.lora)
Rank: 8 · Alpha: 16 · LR: 1e-5 · Iters: 1,000
Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB
Final loss: ~3.5e-7
Fusion
All 378 LoRA pairs merged via weight arithmetic:
W_merged = dequantize(W_base) + (A @ B).T × (alpha / rank)
Result dequantized to bfloat16 and saved as 3-shard safetensors.
🦙 Ollama — One-Command Setup
Instant run (no install needed)
ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
With a custom system prompt + tool support
Create a Modelfile:
FROM hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
SYSTEM "You are a helpful assistant with tool-use capabilities. Think through problems step by step."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
ollama create ravenx-gemma4 -f Modelfile
ollama run ravenx-gemma4
OpenAI-compatible endpoint
# Ollama exposes an OpenAI-compatible API automatically
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Run with mlx_lm server (native, faster on Apple Silicon)
# mlx_lm server is faster than Ollama for Apple Silicon — uses Metal GPU directly
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080
# Then use any OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
Related models
| Model | Size | Notes |
|---|---|---|
| gemma-4-E4B-mlx-4bit | 4.86 GB | Base model (4-bit, use with adapter) |
| gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit | ~10.5 GB | This model — fused, no adapter needed |
| gemma-4-E4B-opus-reasoning-claude-code-lora | 658 MB | Adapter-only |
| gemma-4-E2B-Heretic-Uncensored-mlx-4bit | 3.34 GB | 2B abliterated |
| gemma-4-21b-REAP-Tool-Calling-mlx-4bit | 12 GB | 21B MoE REAP |
License
TriAttention KV Compression
[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).
Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
RavenX Inference Harness
One-command inference, benchmarking, and local OpenAI-compatible server:
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness
# Inference
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"
# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048
# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention
- Downloads last month
- 5,610
4-bit
Model tree for deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit
Base model
google/gemma-4-E4B-it