Gemma 4 E2B - GGUF

This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 E2B.

These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware. The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning depth of a larger model while only utilizing 2.3B effective parameters (out of 5.1B total) during generation. It is the most lightweight model in the Gemma 4 family, specifically engineered for ultra-fast local execution on mobile devices, edge hardware, and entry-level laptops.

πŸ“¦ Available Quantization Formats

Below are multiple quantization "flavors" to best match your hardware capabilities.

File Name Bit-Rate Size Target VRAM / RAM Description
gemma-4-e2b-Q8_0.gguf 8-bit ~5.4 GB 8 GB+ Purest quality, zero noticeable logic loss.
gemma-4-e2b-Q6_K.gguf 6-bit ~4.2 GB 6 GB+ Near-perfect reasoning retention. Fits easily on 6GB GPUs.
gemma-4-e2b-Q5_K_M.gguf 5-bit ~3.7 GB 6 GB+ High precision, ideal for edge devices.
gemma-4-e2b-Q4_K_M.gguf 4-bit ~3.1 GB 4 GB+ Recommended. The sweet spot for 4GB GPUs (like the RTX 3050).

Note: Context windows (e.g., 8K or the maximum 128K) will require additional VRAM allocation for the KV Cache.

πŸš€ How to Use

These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).

πŸ’» Command Line (llama.cpp)

To run the recommended Q4_K_M model with full GPU offloading and an 8K context window:

./llama-cli -m gemma-4-e2b-Q4_K_M.gguf -n 2048 -c 8192 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."

🐍 Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
    model_path="./gemma-4-e2b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload entirely to GPU
    n_ctx=8192,      # 8K Context Window
    flash_attn=True
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
    ]
)

print(response["choices"][0]["message"]["content"])

βš–οΈ License & Acknowledgements

These weights are derivative works of Google's Gemma 4 E2B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.

Downloads last month
348
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support