Gemma 4 E4B - GGUF
This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 E4B.
These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware. The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning capabilities of a larger model while only utilizing 4.5B effective parameters (out of 8B total) during generation. It is specifically optimized for local execution on laptops and mobile edge devices.
π¦ Available Quantization Formats
Below are multiple quantization "flavors" to best match your hardware capabilities.
| File Name | Bit-Rate | Size | Target VRAM / RAM | Description |
|---|---|---|---|---|
gemma-4-e4b-Q8_0.gguf |
8-bit | ~8.5 GB | 12 GB+ | Purest quality, zero noticeable logic loss. |
gemma-4-e4b-Q6_K.gguf |
6-bit | ~6.5 GB | 8 GB+ | Near-perfect reasoning retention. Fits easily on 8GB GPUs. |
gemma-4-e4b-Q5_K_M.gguf |
5-bit | ~5.7 GB | 8 GB+ | High precision, ideal for coding and complex math. |
gemma-4-e4b-Q4_K_M.gguf |
4-bit | ~4.8 GB | 6 GB+ | Recommended. The sweet spot for 6GB/8GB laptops and MacBooks. |
Note: Context windows (e.g., 8K or the maximum 128K) will require additional VRAM allocation for the KV Cache.
π How to Use
These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).
π» Command Line (llama.cpp)
To run the recommended Q4_K_M model with full GPU offloading and an 8K context window:
./llama-cli -m gemma-4-e4b-Q4_K_M.gguf -n 2048 -c 8192 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."
π Python (llama-cpp-python)
from llama_cpp import Llama
# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
model_path="./gemma-4-e4b-Q4_K_M.gguf",
n_gpu_layers=-1, # Offload entirely to GPU
n_ctx=8192, # 8K Context Window
flash_attn=True
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
]
)
print(response["choices"][0]["message"]["content"])
βοΈ License & Acknowledgements
These weights are derivative works of Google's Gemma 4 E4B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.
- Downloads last month
- 280
4-bit
5-bit
6-bit
8-bit