Gemma 4 31B - GGUF
This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 31B.
These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware, including Blackwell architecture (RTX 50-series) and advanced Apple Silicon.
π¦ Available Quantization Formats
Below are multiple quantization "flavors" to best match your hardware capabilities.
| File Name | Bit-Rate | Size | Target VRAM / RAM | Description |
|---|---|---|---|---|
gemma-4-31b-Q8_0.gguf |
8-bit | ~32.1 GB | 36 GB+ | Purest quality, zero noticeable logic loss. |
gemma-4-31b-Q6_K.gguf |
6-bit | ~25.5 GB | 28 GB+ | Near-perfect reasoning retention. Fits perfectly on 32GB GPUs. |
gemma-4-31b-Q5_K_M.gguf |
5-bit | ~22.5 GB | 24 GB+ | High precision, ideal for coding and complex math. |
gemma-4-31b-Q4_K_M.gguf |
4-bit | ~18.2 GB | 20 GB+ | Recommended. The sweet spot for 24GB GPUs. |
Note: Context windows (e.g., 32K or the maximum 256K) will require additional VRAM allocation for the KV Cache.
π How to Use
These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).
π» Command Line (llama.cpp)
To run the recommended Q4_K_M model with full GPU offloading and a 32K context window:
./llama-cli -m gemma-4-31b-Q4_K_M.gguf -n 2048 -c 32768 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."
π Python (llama-cpp-python)
from llama_cpp import Llama
# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
model_path="./gemma-4-31b-Q4_K_M.gguf",
n_gpu_layers=-1, # Offload entirely to GPU
n_ctx=32768, # 32K Context Window
flash_attn=True
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
]
)
print(response["choices"][0]["message"]["content"])
βοΈ License & Acknowledgements
These weights are derivative works of Google's Gemma 4 31B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.
- Downloads last month
- 448
4-bit
5-bit
6-bit
8-bit