Gemma 4 E2B - GGUF

This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 E2B.

These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware. The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning depth of a larger model while only utilizing 2.3B effective parameters (out of 5.1B total) during generation. It is the most lightweight model in the Gemma 4 family, specifically engineered for ultra-fast local execution on mobile devices, edge hardware, and entry-level laptops.

📦 Available Quantization Formats

Below are multiple quantization "flavors" to best match your hardware capabilities.

File Name	Bit-Rate	Size	Target VRAM / RAM	Description
`gemma-4-e2b-Q8_0.gguf`	8-bit	~5.4 GB	8 GB+	Purest quality, zero noticeable logic loss.
`gemma-4-e2b-Q6_K.gguf`	6-bit	~4.2 GB	6 GB+	Near-perfect reasoning retention. Fits easily on 6GB GPUs.
`gemma-4-e2b-Q5_K_M.gguf`	5-bit	~3.7 GB	6 GB+	High precision, ideal for edge devices.
`gemma-4-e2b-Q4_K_M.gguf`	4-bit	~3.1 GB	4 GB+	Recommended. The sweet spot for 4GB GPUs (like the RTX 3050).

Note: Context windows (e.g., 8K or the maximum 128K) will require additional VRAM allocation for the KV Cache.

🚀 How to Use

These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).

💻 Command Line (llama.cpp)

To run the recommended Q4_K_M model with full GPU offloading and an 8K context window:

./llama-cli -m gemma-4-e2b-Q4_K_M.gguf -n 2048 -c 8192 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."

🐍 Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
    model_path="./gemma-4-e2b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload entirely to GPU
    n_ctx=8192,      # 8K Context Window
    flash_attn=True
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
    ]
)

print(response["choices"][0]["message"]["content"])

⚖️ License & Acknowledgements

These weights are derivative works of Google's Gemma 4 E2B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.

Downloads last month: 348

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

6-bit

8-bit