Gemma 4 E4B - GGUF

This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 E4B.

These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware. The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning capabilities of a larger model while only utilizing 4.5B effective parameters (out of 8B total) during generation. It is specifically optimized for local execution on laptops and mobile edge devices.

📦 Available Quantization Formats

Below are multiple quantization "flavors" to best match your hardware capabilities.

File Name	Bit-Rate	Size	Target VRAM / RAM	Description
`gemma-4-e4b-Q8_0.gguf`	8-bit	~8.5 GB	12 GB+	Purest quality, zero noticeable logic loss.
`gemma-4-e4b-Q6_K.gguf`	6-bit	~6.5 GB	8 GB+	Near-perfect reasoning retention. Fits easily on 8GB GPUs.
`gemma-4-e4b-Q5_K_M.gguf`	5-bit	~5.7 GB	8 GB+	High precision, ideal for coding and complex math.
`gemma-4-e4b-Q4_K_M.gguf`	4-bit	~4.8 GB	6 GB+	Recommended. The sweet spot for 6GB/8GB laptops and MacBooks.

Note: Context windows (e.g., 8K or the maximum 128K) will require additional VRAM allocation for the KV Cache.

🚀 How to Use

These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).

💻 Command Line (llama.cpp)

To run the recommended Q4_K_M model with full GPU offloading and an 8K context window:

./llama-cli -m gemma-4-e4b-Q4_K_M.gguf -n 2048 -c 8192 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."

🐍 Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
    model_path="./gemma-4-e4b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload entirely to GPU
    n_ctx=8192,      # 8K Context Window
    flash_attn=True
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
    ]
)

print(response["choices"][0]["message"]["content"])

⚖️ License & Acknowledgements

These weights are derivative works of Google's Gemma 4 E4B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.

Downloads last month: 280

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

6-bit

8-bit